Elements of Nonlinear Series Analysis and Forecasting PDF
Elements of Nonlinear Series Analysis and Forecasting PDF
Elements of Nonlinear Series Analysis and Forecasting PDF
Jan G. De Gooijer
Elements of
Nonlinear Time
Series Analysis
and Forecasting
Springer Series in Statistics
Series editors
Peter Bickel, CA, USA
Peter Diggle, Lancaster, UK
Stephen E. Fienberg, Pittsburgh, PA, USA
Ursula Gather, Dortmund, Germany
Ingram Olkin, Stanford, CA, USA
Scott Zeger, Baltimore, MD, USA
More information about this series at http://www.springer.com/series/692
Jan G. De Gooijer
Elements of Nonlinear
Time Series Analysis
and Forecasting
123
Jan G. De Gooijer
University of Amsterdam
Amsterdam, The Netherlands
Empirical time series analysis and modeling has been deviating, over the last 40 years
or so, from the linear paradigm with the aim of incorporating nonlinear features. In-
deed, there are various occasions when subject-matter, theory or data suggests that
a time series is generated by a nonlinear stochastic process. If theory could provide
some understanding of the nonlinear phenomena underlying the data, the modeling
process would be relatively easy, with estimation of the model parameters being all
that is required. However, this option is rarely available in practice. Alternatively,
a particular nonlinear model may be selected, fitted to the data and subjected to a
battery of diagnostic tests to check for features that the model has failed adequately
to approximate. Although this approach corresponds to the usual model selection
strategy in linear time series analysis, it may involve rather more problems than in
the linear case.
One immediate problem is the selection of an appropriate nonlinear model or
method. However, given the wealth of nonlinear time series models now available,
this is a far from easy task. For practical use a good nonlinear model should at least
fulfill the requirement that it is general enough to capture some of the nonlinear
phenomena in the data and, moreover, should have some intuitive appeal. This
implies a systematic account of various aspects of these models and methods.
The Hungarian mathematician John von Neumann once said that the study of
nonlinear functions is akin to the study of non-elephants. 1 This remark illustrates
a common problem with nonlinear theory, which in our case is equivalent to non-
linear models/methods: the subject is so vast that it is difficult to develop general
approaches and theories similar to those existing for linear functions/models. Fortu-
nately, over the last two to three decades, the theory and practice of “non-elephants”
has made enormous progress. Indeed, several advancements have taken place in the
nonlinear model development process in order to capture specific nonlinear features
of the underlying data generating process. These features include symptoms such as
1
A similar remark is credited to the Polish mathematician Stanislaw M. Ulam saying that using
a term like nonlinear science is like referring to the bulk of zoology as the study of non-elephant
animals; Campbell, Farmer, Crutchfield, and Jen (1985), “Experimental mathematics: The role of
computation in nonlinear science”. Communications of the ACM, 28(4), 374–384.
vii
viii Preface
2
Throughout the book, I will use the terms forecast and prediction interchangeably, although
not quite precisely. That is, prediction concerns statements about the likely outcome of unobserved
events, not necessarily those in the future.
Preface ix
references allows readers to follow up on original sources for more technical details
on different methods. As a further help to facilitate reading, each chapter concludes
with a set of key terms and concepts, and a summary of the main findings.
Real data
It is well known that real data analysis can reduce the gap between theory and
practice. Hence, throughout the book a broad set of empirical time series, originating
from many different scientific fields, will be used to illustrate the main points of the
text. This already starts off in Chapter 1 where I introduce five empirical time series
which will be used as “running” examples throughout the book. In later chapters,
other concrete examples of nonlinear time series analysis will appear. In each case,
I provide some background information about the data so that the general context
becomes clear. It may also help the reader to get a better understanding of specific
nonlinear features in the underlying data generating mechanism.
The first track (Chapters 2, 3, 5 – 8, and 10) mainly includes parametric non-
linear models and techniques for univariate time series analysis. Here, the overall
outline basically follows the iterative cycle of model identification, parameter es-
timation, and model verification by diagnostic checking. In particular, Chapter 2
concentrates on some important nonlinear model classes. Chapter 3 introduces the
concepts of stationarity and invertibility. The material on time-domain linearity
testing (Chapter 5), model estimation and selection (Chapter 6), tests for serial
dependence (Chapter 7), and time-reversibility (Chapter 8) relates to Chapter 2.
Although Chapter 7 is clearly based on nonparametric methods, the proposed test
statistics try to detect structure in “residuals” obtained from fitted parametric mod-
els, and hence its inclusion in this track. If forecasting from parametric univariate
time series models is the objective, Chapter 10 provides a host of methods. As a part
of the entire forecasting process, the chapter also includes methods for the construc-
tion of forecast intervals/regions, and methods for the evaluation and combination
of forecasts.
When sufficient data is available, the flexibility offered by many of the semi-
and nonparametric techniques in the second track may be preferred over parametric
models/methods. A possible starting point of this track is to test for linearity and
Gaussianity through spectral density estimation methods first (Chapter 4). In some
situations, however, a reader can jump directly to specific sections in Chapter 9
which contain extensive material on analyzing nonlinear time series by semi- and
nonparametric methods. Also some sections in Chapter 9 discuss forecasting in a
semi- and nonparametric setting. Finally, both tracks contain chapters on multivari-
ate nonlinear time series analysis (Chapters 11 and 12). The following exhibit gives
a rough depiction of how the two tracks are interrelated.
Univariate Multivariate
8
Parametric 2 5 6 7 10 11
ical and simulation exercises. The simulation questions are designed to provide
the reader with first-hand information on the behavior and performance of some
of the theoretical results. The empirical exercises are designed to obtain a good
understanding of the difficulties involved in the process of modeling and forecasting
nonlinear time series using real-world data.
The book includes an extensive list of references. The many historical references
should be of interest to those wishing to trace the early developments of nonlinear
time series analysis. Also, the list contains references to more recent papers and
books in the hope that it will help the reader find a way through the bursting
literature on the subject.
Reading roadmaps
I do not anticipate that the book will be read cover to cover. Instead, I hope that
the extensive indexing, ample cross-referencing, and worked examples will make it
possible for readers to directly find and then implement what they need. Neverthe-
less, those who wish to obtain an overall impression of the book, I suggest reading
Chapters 1 and 2, Sections 5.1 – 5.5, Sections 6.1 – 6.2, Sections 7.2 – 7.3, and
Chapters 9 and 10. Chapter 3 is more advanced, and can be omitted on a first read-
ing. Similarly, Chapter 8 can be read at a later stage because it is not an essential
part of the main text. In fact this chapter is somewhat peripheral.
Readers who wish to use the book to find out how to obtain forecasts of a
data generating process maybe “expected” to have nonlinear features, may find the
following reading suggestions useful.
• Start with Chapter 1 to get a good understanding of the central concepts
such as linearity, Gaussianity, and stationarity. For instance, by exploring
a recurrence plot (Section 1.3.4) one may detect particular deviations from
the assumption of strict stationarity. This information, added to the many
stationarity tests available in the literature, may provide a starting point for
selecting and understanding different nonlinear (forecasting) models.
• To further support the above objectives, Sections 2.1 – 2.10 are worth reading
next. It is also recommended to read Section 6.1 on model estimation.
• Section 3.5 introduces the concept of invertibility, which is directly linked to
the concept of forecastability. So this section should be a part of the reading-
list.
• Continue by reading Sections 5.1 on Lagrange multiplier type tests. These tests
are relatively easy to carry out in practice, provided the type of nonlinearity is
known in advance. The diagnostic tests of Sections 5.4, and the tests of Section
5.5, may provide additional information about potential model inadequacies.
• Next, continue reading Section 6.2.2 on model selection criteria.
• Finally, reading all or parts of the material in Chapter 10 is a prerequisite for
model-based forecasting and forecast evaluation. Alternatively, readers with
an interest in semi- and nonparametric models/methods may want to consult
(parts of) Chapter 12.
xii Preface
Acknowledgments
The first step in writing a book on nonlinear time series analysis dates back to the
year 1999. Given the growing interest in the field, both Bonnie K. Ray and I felt
that there was a need for a book of this nature. However, our joint efforts on the
book ended at an early stage because of a change of job (BKR) and various working
commitments (JDG). Hence, it is appropriate to begin the acknowledgement section
by thanking Bonnie for writing parts of a former version of the text. I also thank
her for valuable feedback, comments and suggestions on earlier drafts of chapters.
Many of the topics described in the book are outgrowths of co-authored research
papers and publications. These collaborations have greatly added to the depth and
breadth of the book. In particular, I would like to acknowledge Kurt Brännäs, Paul
De Bruin, Ali Gannoun, Kuldeep Kumar, Eric Matzner–Løber, Martin Knotters,
Selliah Sivarajasingham, Antoni Vidiella–i–Anguera, Ao Yuan, and Dawit Zerom.
In addition, I am very grateful to Roberto Baragona, Cees Diks, and Mike Cle-
ments who read selective parts of the manuscript and offered helpful suggestions
for improvement. Thanks also go to the many individuals who have been willing to
share their computer code and/or data with me. They are: Tess Astatkie, Luca Bag-
nato, Francesco Battaglia, Brendan Beare, Arthur Berg, Yuzhi Cai, Kung-Sik Chan,
Yi-Ting Chen, Daren Cline, Kilani Ghoudi, Jane L. Harvill, Yongmia Hong, Rob
Preface xiii
Hyndman, Nusrat Jahan, Leena Kalliovirta, Dao Li, Dong Li, Guodong Li, Jing Li,
Shiqing Ling, Sebastiano Manzan, Marcelo Medeiros, Marcella Niglio, Tohru Ozaki,
Li Pan, Dimitris N. Politis, Nikolay Robinzonov, Elena Rusticelli, Hans J. Skaug,
Chan Wai Sum, Gyorgy Terdik, Howell Tong, Ruey S. Tsay, David Ubilava, Yingcun
Xia, and Peter C. Young (with apologies to anyone unintentionally left out). Finally,
I would like to thank all the publishers for permission to use materials from papers
that have appeared in their journals.
Preface vii
xv
xvi CONTENTS
3 PROBABILISTIC PROPERTIES 87
3.1 Strict Stationarity 88
3.2 Second-order Stationarity 90
3.3 Application: Nonlinear AR–GARCH model 91
3.4 Dependence and Geometric Ergodicity 95
3.4.1 Mixing coefficients 95
3.4.2 Geometric ergodicity 96
3.5 Invertibility 101
3.5.1 Global 101
3.5.2 Local 107
3.6 Summary, Terms and Concepts 110
3.7 Additional Bibliographical Notes 110
3.8 Data and Software References 111
Appendix 112
3.A Vector and Matrix Norms 112
3.B Spectral Radius of a Matrix 114
Exercises 115
8 TIME-REVERSIBILITY 315
8.1 Preliminaries 316
8.2 Time-Domain Tests 317
8.2.1 A bicovariance-based test 317
8.2.2 A test based on the characteristic function 319
8.3 Frequency-Domain Tests 322
8.3.1 A bispectrum-based test 322
8.3.2 A trispectrum-based test 323
CONTENTS xix
10 FORECASTING 391
10.1 Exact Least Squares Forecasting Methods 392
10.1.1 Nonlinear AR model 392
10.1.2 Self-exciting threshold ARMA model 394
10.2 Approximate Forecasting Methods 398
10.2.1 Monte Carlo 398
10.2.2 Bootstrap 399
10.2.3 Deterministic, naive, or skeleton 399
10.2.4 Empirical least squares 400
10.2.5 Normal forecasting error 401
10.2.6 Linearization 404
10.2.7 Dynamic estimation 406
10.3 Forecast Intervals and Regions 408
10.3.1 Preliminaries 408
xx CONTENTS
References 529
Informally, a time series is a record of a fluctuating quantity observed over time that
has resulted from some underlying phenomenon. The set of times at which observa-
tions are measured can be equally spaced. In that case, the resulting series is called
discrete. Continuous time series, on the other hand, are obtained when observations
are taken continuously over a fixed time interval. The statistical analysis can take
many forms. For instance, modeling the dynamic relationship of a time series, ob-
taining its characteristic features, forecasting future occurrences, and hypothesizing
marginal statistics. Our concern is with time series that occur in discrete time and
are realizations of a stochastic/random process.
The foundations of classical time series analysis, as collected in books such as
Box et al. (2008), Priestley (1981), and Brockwell and Davis (1991), to name just a
few, is based on two underlying assumptions, stating that:
• The time series process is an output from a linear filter whose input is a purely
random process, known as white noise (WN), usually following a Gaussian, or
normal, distribution. A typical example of a stationary linear Gaussian process
is the well-known class of autoregressive moving average (ARMA) processes.
Although these twin assumptions are reasonable, there remains the rather prob-
lematic fact that in reality many time series are neither stationary, nor can be
described by linear processes. Indeed, there are many more occasions when subject-
matter, theory or data suggests that a stationarity-transformed time series is gen-
erated by a nonlinear process. In addition, a large fraction of time series cannot
be easily transformed to a stationary process. Examples of nonstationary and/or
nonlinear time series abound in the fields of radio engineering, marine engineering,
i.e., {εt } is a sequence of independent and identically (i.i.d.) random variables with
mean zero and finite variance σε2 . Such a sequence is also referred to as strict white
noise as opposed to weak white noise, which is a stationary sequence of uncorrelated
random variables. Obviously the requirement that {εt } is i.i.d. is more restrictive
than that this sequence is serially uncorrelated. Independence implies that third and
higher-order non-contemporaneous moments of {εt } are zero, i.e., E(εt εt−i εt−j ) = 0
∀i, j = 0, and similarly for fourth and higher-order moments. When {εt } is assumed
to be Gaussian distributed, the two concepts of white noise coincide.
More generally, the above concepts of white noise are in increasing degree of
“whiteness” part of the following classification system:
This infinite moving average (MA) representation should not be confused with the
Wold decomposition theorem for purely nondeterministic time series processes. In
(1.2) the process {εt } is only assumed to be i.i.d. and not weakly WN as in the
Wold representation. The linear representation (1.2) can also be derived under
the assumption that the spectral density function of {Yt , t ∈ Z} is positive almost
everywhere, except in the Gaussian case when all spectra of order higher than two
are identically zero; see Chapter 4 for details. Note that a slightly weaker form of
(1.2) follows by assuming that the process {εt } fulfills the conditions in (iii).
Time series processes such as (1.2) have the convenient mathematical property
that the best H-step ahead (H ≥ 1) mean squared predictor, or forecast, of Yt+H ,
denoted by E(Yt+H |Ys , −∞ < s ≤ t), is identical to the best linear predictor; see,
e.g., Brockwell and Davis (1991, Chapter 5). This result has been the basis of an
alternative definition of linearity. Specifically, a time series is said to be essentially
linear , if for a given infinite past set of observations the linear least squares predictor
is also the least squares predictor. In Chapter 4, we will return to this definition of
linearity.
Now suppose that {εt } ∼ WN(0, σε2 ) in (1.2). In that case the best mean square
predictor may not coincide with the best linear predictor. Moreover, under this
assumption, the complete probabilistic structure of {εt } is not specified: thus, nor
is the full probabilistic structure of {Yt }. Also, by virtue of {εt } being uncorrelated,
there is still information left in it. A partial remedy is to impose the assumption
that {Yt , t ∈ Z} is a Gaussian process, which implies that the process {εt } is also
Gaussian. Hence, (1.2) becomes
∞
∞
ψi2 < ∞, {εt } ∼ N (0, σε2 ).
i.i.d.
Yt = εt + ψi εt−i , where (1.3)
i=1 i=1
4 1 INTRODUCTION AND SOME BASIC CONCEPTS
Figure 1.1: Quarterly U.S. unemployment rate (in %) (252 observations); red triangle up
= business cycle peak, red triangle down = business cycle trough.
Then, the best mean square predictor of {Yt , t ∈ Z} equals the best linear predictor.
So, in summary, we classify a process {Yt , t ∈ Z} as nonlinear if neither (1.1) nor
(1.2) hold.
Finally, we mention that it is common to label a combined stochastic process,
such as (1.1) or (1.2), as the data generating process (DGP). A model should be
distinguished from a DGP. A DGP is a complete characterization of the statistical
properties of {Yt , t ∈ Z}. On the other hand, a model aims to provide a concise and
reasonably accurate reflection of the DGP.
Figure 1.2: (a) EEG recordings in voltage (μV ) for a data segment of 631 observations
(just over 3 seconds of signal), and (b) the reversed data plot.
falls in the range 1 – 20 Hz. Activity below or above this range is likely to be
an artifact of non-cerebral origin under standard normal recording techniques.
The spike and wave activity is clearly visible with periodic spikes separated
by slow waves. Note that there are differences in the rate at which the EEG
series rises to a maximum, and the rate at which it falls away from it. This is
an indication that the DGP underlying the series is not time-reversible.
A strictly stationary process {Yt , t ∈ Z} is said to be time-reversible if its
probability structure is invariant with respect to the reversal of time indices;
see Chapter 8 for a more formal definition. If such invariance does not hold, the
process is said to be time-irreversible. All stationary Gaussian processes are
time-reversible. The lack of time-reversibility is either an indication to consider
a linear stationary process with non-Gaussian (non-normal) innovations or a
nonlinear process. No point transformation, like the Box–Cox method, can
transform a time-irreversible process into a Gaussian process because such a
transformation only involves the marginal distribution of the series and ignores
dependence.
One simple way to detect departures from time-reversibility is to plot the time
series with the time axis reversed. Figure 1.2(b) provides an example. Clearly,
the mirror image of the series is not similar to the original plot. Thus, there is
evidence against reversibility. In general, looking at a reverse time series plot
can reinforce the visual detection of seasonal patterns, trends, and changes in
mean and variance that might not be obvious from the original time plot.
Figure 1.3: Magnetic field data set, T component (in nT units) in RTN coordinate system.
Time period: February 17, 1992 – June 30, 1997 (1,962 observations).
We see relatively large interplanetary shock waves at the beginning of the series
followed by a relatively stable period. Then, a considerable increase in wave
activity occurs on and around January 11, 1995. In general there is a great
variability in the strength of the magnetic field at irregular time intervals. No
linear model can account for these effects in the data.
Figure 1.4: (a) Plot of the Niño 3.4 index for the time period January 1950 – March 2012
(748 observations); (b) 5-month running average of the Niño 3.4 index with El Niño events
(red triangle up) and La Niña events (green triangle down).
model that allows for a smooth transition from an El Niño to a La Niña event,
and vice versa.
-0.5 -1
1.5
0
1
0.5
0.5
1
0
-0.5
-1
-1.5
Figure 1.5: Cave plot of the δ 13 C (top, axis on the right) and δ 18 O (bottom, axis on the
left) time series. Time interval covers 896 – 2 ka (1 ka = 1,000 years); T = 216.
Figure 1.5 shows two plots of the univariate time series δ 13 C (denoted by
{Y1,t }) and δ 18 O (denoted by {Y2,t }), both of length T = 216, for the late
Pleistocene ice ages.2 The graph is called a cave plot since the visual distance
between the two curves resembles the inside of a cave. The cave plot is con-
structed so that if the dependence of {Y1,t } on {Y2,t } is linear and constant
over time then the visual distance between the curves is constant. In the
present case, this is accomplished by a linear regression of the series {Y2,t } on
{Y1,t } and obtaining the “transformed” series {Y1,t } as the fitted values.3
From the plot we see that the difference between the curves is not constant
during this particular climatic period. This feature makes the data suitable
for nonlinear modeling. In addition, we notice a clear correlation between
series, with values of δ 13 C increasing when δ 18 O decreases, and vice versa.
This suggests some nonlinear causality between the two series. In general,
these graphs can give a useful visual indication of joint (non)linear short- and
long-term periodic fluctuations, even if the two series are observed at irregular
times as in the present case.
alization techniques have been proposed for this purpose. Here, we discuss a small
subset of methods which we recommend for addition to the reader’s basic toolkit.
For a symmetric distribution μ3,X = 0, and thus τX will be zero. The kurtosis for
the normal distribution is equal to 3. When κX > 3, the distribution of X is said
to have fat tails.
Let {Xi }ni=1 denote an i.i.d. random sample of X of sizen. Then μr,X can be
consistently
estimated by the sample moments μ r,X = n−1 ni=1 (Xi − X)r , where
X = n−1 ni=1 Xi . Sample analogues of τX and κX are given by
1 1
n n
τX = (Xi − X)3 , X =
κ (Xi − X)4 , (1.4)
σX3 i=1
n σX4 i=1
n
where
1
n
X2
σ ≡μ
2,X = (Xi − X)2 .
n
i=1
If {Xi } ∼ N (0, σX
2 ) then, as n → ∞,
i.i.d.
√ τX D 0 6 0
n −→ N , . (1.5)
κX 3 0 24
Using this asymptotic property, we can perform a Student t-test for testing the null
hypothesis H0 : τX = 0, or testing H0 : κX − 3 = 0, separately. A joint test of the
null hypothesis H0 : τX = 0 and κX − 3 = 0, is often used as a test statistic for
normality. This leads to the so-called JB (Jarque and Bera, 1987) test statistic, i.e.,
τ 2
X κX − 3)2
(
JB = n + , (1.6)
6 24
which has an asymptotic χ22 distribution under H0 , as n → ∞ .
1.3 INITIAL DATA ANALYSIS 11
X , (SX ) , where
−i 1 1 −i
1/2
−i
X = Xj , SX = (Xj − X )2 , (i = 1, . . . , n).
n−1 n−2
j=i j=i
−i 2/3
Next, apply the approximately normalizing cube-root transformation Yi = (SX ) ,
and compute the sample correlation coefficient
n
− X)(Yi − Y )
i=1 (Xi
rXY =
n
n
i=1 (Xi − X) i=1 (Yi − Y )
2 2
and skewness reducing character of the Fisher z-transform, obtain the test statistic
1 1+r
XY
Z2 = log . (1.7)
2 1 − rX,Y
If the series {Xi }ni=1 consists of i.i.d. normal variables, then it can be shown (Lin
and Mudholkar, 1980) that Z2 is asymptotically normally distributed with mean 0
and variance 3/n.
Within a time series framework, the JB and Z2 test statistics are typically applied
to the residuals, usually written simply as εt , of a fitted univariate (non)linear time
series model as a final diagnostic step in the modeling process. A drawback of the JB
test is that the finite-sample tail quantiles are quite different from their asymptotic
counterparts. Alternatively, p-values of the JB test can be determined by means
of bootstrapping (BS) or Monte Carlo (MC) simulation. A better-behaved JB test
statistic can be obtained using exact means and variances instead of the asymptotic
mean and variance of the standardized third- and fourth moments (cf. Exercise 1.5).
Nevertheless, the JB and Z2 tests only rely on the departure of the symmetry of
possible alternatives to the normal distribution. However, the question whether
for instance a positive skewness in the original series is reproduced by the fitted
nonlinear model cannot be answered by analyzing the residuals alone.
Example 1.6: Summary Statistics
Table 1.1 reports summary statistic for the series introduced in Section 1.2.
Except for the U.S. unemployment rate, for which we take the first differences,
we consider the original data. Note from the last column that the sample
kurtosis of the U.S. unemployment rate and the magnetic field data are much
12 1 INTRODUCTION AND SOME BASIC CONCEPTS
Table 1.1: Summary statistics for the time series introduced in Section 1.2.
larger than the kurtosis for a normal distribution, indicating that both series
have heavy tails. Further, the sample skewness of the series indicates no
evidence of asymmetry. Below we search for more evidence to support these
observations, using a skewness-kurtosis test statistic that is able to account
for serial correlation.
where
∞
r
Fr,Y = γY () , (r = 3, 4).
=−∞
r
A consistent estimator of Fr,Y is given by Fr,Y = ||<T γY () , and hence a
generalized JB (GJB) statistic for testing normality in weakly dependent data is
given by
Tμ23,Y μ4,Y − 3
T ( μ2,Y )2
GJB = + , (1.9)
6F3,Y 24F4,Y
1.3 INITIAL DATA ANALYSIS 13
which has an asymptotic χ22 distribution under the null hypothesis (Lobato and
Velasco, 2004). Moreover, the test statistic is consistent under the alternative hy-
pothesis.
Comparing (1.6) and (1.9), we see that asymptotically the GJB test statistic
reduces to the JB test statistic if the DGP is i.i.d., since γY () → 0, ∀ = 0, and
Y (0) = μ
γ 2,Y = 0. Also observe that with positive serial correlation in the first few
lags, the denominator in (1.9) will be larger than in JB. Consequently, the chance
of rejecting normality will decrease when using the GJB test statistic.
where α = (1, −3σY2 ) is a 2 × 1 vector, and Γ22 is the first 2 × 2 block matrix of
Γ = limT →∞ T E(Z Z ) with Z
the sample mean of {Zt }.
In applications, α can be consistently estimated by its sample counterpart α =
(1, −3 2
σY ) . A consistent and robust estimate, say Γ22 , of the long-run covariance
matrix Γ22 can be obtained by kernel-based estimation. Let s(τY ) = (α 22 α/
Γ σY6 )1/2 .
Then, under the null hypothesis τY = 0, the limiting distribution of the estimated
coefficient of skewness is given by
√
T τY D
3,Y =
π −→ N (0, 1), (1.10)
s(
τY )
where it is assumed that E(Yt6 ) < ∞.
Also, Bai and Ng (2005) develop a statistic for testing kurtosis. Similar to the
i.i.d. case, the coefficient of kurtosis and its sample analogue are defined as
2
κY = μ4,Y μ22,Y , κ Y = μ
4,Y μ2,Y .
Suppose that E(Yt8 ) < ∞. Let Wt = (Yt − μY )4 − μ4,Y , (Yt − μY ), (Yt − μY )2 − σY2
be a 3 × 1 vector. Then, under the null hypothesis κY = 3, and as T → ∞, it can
be shown that
√ β Ωβ
D
κY − 3) −→ N 0, 8
T ( ,
σY
W
where β = (1, −4μ3,Y , −6σY2 ) is a 3 × 1 vector, and Ω = limT →∞ T E(W ) with
the sample mean of {Wt }.
W
14 1 INTRODUCTION AND SOME BASIC CONCEPTS
π 3,Y
34,Y = π 2
4,Y
+π 2
, (1.12)
Table 1.2: Test statistics for serially correlated data. The long-run covariance matrices
3,Y , π
of the test statistics π 4,Y , and π
34,Y are estimated by the kernel method with Parzen’s
lag window; see (4.18).
√
T is given by ±1.96/ T . However, using Bartlett’s formula can lead to spurious
results (Berlinet and Francq, 1997) as it is derived under the precise assumptions
of linearity of the underlying DGP and vanishing of its fourth-order cumulants (cf.
Exercise 1.3).
where sign(u) = 1 (−1, 0) if and only if u > (<, =) 0. Then Kendall’s τ test statistic
is defined as
−1 n
n Nc − Np
τ = h(i, j) = 1 . (1.13)
2 n(n − 1)
2
i<j
Here Nc (c for concordant ) is the number of pairs for which h(i, j) is positive, and
Nd (d for disconcordant ) is the number of pairs for which h(i, j) is negative.
It is immediately verifiable that (1.13) always lies in the range −1 ≤ τ ≤ 1,
where values 1, −1, and 0 signify a perfect positive relationship, a perfect negative
relationship, and no relationship at all, respectively. The null hypothesis, H0 , is that
the random variables X and Y are independent while the alternative hypothesis, H1 ,
is they are not independent. For large samples, the asymptotic null distribution of
τ is normal with mean zero and variance 2(2n + 5)/9n(n − 1) ≈ 4/9n. Note that
one of the properties of τ is that one of its variables of (Xi , Yi ) can be replaced
by its associated ranks. The resulting test statistic is commonly known as the
Mann–Kendall test statistic, which has been used as a nonparametric test for trend
detection and seasonality within the context of linear time series analysis.
16 1 INTRODUCTION AND SOME BASIC CONCEPTS
To obtain a version of Kendall’s τ test statistic suitable for testing against serial
−
dependence in a time series {Yt }Tt=1 , simply replace {(Xi , Yi )}ni=1 by {(Ri , Ri+ )}Ti=1
where {Ri } are the ranks of {Yt }. Then Kendall’s τ test statistic may be defined as
T − 4Nd ()
τ() = 1 − 2Nd () =1− , (1.14)
2 (T − )(T − − 1)
with
T − T
−
Nd () = I(Ri < Rj , Ri+ > Rj+ ).
i=1 j=1
Using the theory of U-statistics for weakly dependent stationary processes (see Ap-
pendix 7.C), it can be shown
√ (Ferguson et al., 2000) that under the null hypothesis
of serial independence T τ(1) is asymptotically distributed as a normal random
variable with
mean zero and variance 4/9 for T ≥ 4. For > 1, explicit expressions
for Var τ() are rather cumbersome to obtain. However, under the null hypothesis
√
of randomness, any K-tuple of the form 3 T ( τ (1), . . . , τ(K)) /2 is asymptotically
multinormal, with mean vector zero and unit covariance matrix.
Table 1.3: Indicator patterns of the sample ACF and values of Kendall’s τ test statistic.
Lag
Series 1 2 3 4 5 6 7 8 9 10
∗ ∗ ∗ ∗ ∗ ∗
U.S. unemployment rate ACF (1)
+ + − − − − − − − −
τ() (2) + • +• + − − • −• − −• − −
δ 18 O ACF + ∗ +∗ + ∗ + + − − −∗ −∗ −∗
τ() + • +• + • + • + − − −• −• −•
(1)
+∗ indicates a sample ACF value greater than 1.96T −1/2 , −∗
indicates a value less than −1.96T −1/2 , and + (−) indicates a
positive (negative) value between −1.96T −1/2 and 1.96T −1/2 .
(2)
• marks a p-value smaller than 5%, and + (−) marks a positive
(negative) value of the test statistic with a p-value larger than 5%.
1.3 INITIAL DATA ANALYSIS 17
• For the U.S. unemployment series the sample ACF suggests, as a first
guess, a linear AR(8) model with significant parameter values at lags 1,
2, 4 – 6, and 8. The results for τ() match those of the sample ACF.
• The sample ACF of the EEG recordings suggests a linear AR(6) model.
On the other hand, Kendall’s τ() test statistics are all significant up
to and including lag = 10. So it is hard to describe the series by a
particular (non)linear model.
• Both the sample ACF and τ() are not very helpful in identifying pre-
liminary models for the magnetic field data and the monthly ENSO time
series. Clearly, the fact that normality is strongly rejected for the mag-
netic field data has an impact on the significance of the series’ test results.
The sample ACF of the ENSO series has a significant negative peak (5%
level) at lag 21 and a positive (insignificant) peak at lag 56. This reflects
the fact that ENSO periods lasted between two and five years in the last
century.
• The sample ACFs of the δ 13 C and δ 18 O series indicate that both series
can be represented by a low order AR process, but there are also some
significant values at lags 8 – 10. The test results for τ() match those of
the sample ACFs.
4Np ()
τp () = 1 − . (1.15)
(T − )(T − − 1)
−
Here Np () is the number of pairs {(Ri , Ri+ )}Ti=1 such that
Zi − Zj
≤ TZ , for
TZ a predefined “tolerance” (e.g. TZ = 0.2T ), with Zi = (Ri+1 , . . . , Ri+−1 ) (i =
1, . . . , T − ), and
·
is a norm. The statistic τp () has similar properties as τ().
Moreover, it can be shown that τp () has an asymptotically normal distribution
under the null hypothesis of no serial dependence.
18 1 INTRODUCTION AND SOME BASIC CONCEPTS
which is just the mathematical expectation of − log fX (x), i.e., −E log fX (x) . Sim-
ilarly, for a pair of random variables (X, Y ) with joint pdf fXY (x, y) the joint entropy
is defined as
H(X, Y ) = − fXY (x, y) log fXY (x, y) dxdy. (1.17)
The corresponding sample estimate, say R Y (), follows from estimating function-
Y (·),
als of density functions. No distributional theory is currently available for R
but empirical critical values may be computed for specific choices of T and ;
see, e.g., Granger and Lin (1994, Table III). Simulations show that R Y () has
a positive bias. One way to avoid such a bias is to redefine (1.21) as RY∗ () =
1 − exp{−2I KL (Yt , Yt+ )}.
and
·
is a norm.6
If {Yt , t ∈ Z} is strictly stationary, the recurrence plot will show an approximately
uniform density of recurrences as a function of the time difference t1 − t2 . However,
if {Yt , t ∈ Z} has a trend or another type of nonstationarity, with a behavior that
()
is changing over time, the regions of Yt visited will change over time. The result
will be that there are relatively few recurrences far from the main diagonal in the
recurrence plot, that is for large values of |t1 − t2 |. Also, if there are only recurrences
4
In the analysis of deterministic chaos, i.e. irregular oscillations that are not influenced by
random inputs, m is often called the embedding dimension. Within that context, it is important to
choose m sufficiently large, such that the so-called m-dimensional phase space enables for a “proper”
representation of the dynamical system.
5
In economics and finance, but not in other fields, it is common to fix at one. So m takes over
the role of . In that case we write Yt , suppressing the dependence on .
6
In fact, the supremum norm is very popular for recurrence plots; see Appendix 3.A for more
information on vector and matrix norms.
20 1 INTRODUCTION AND SOME BASIC CONCEPTS
near t1 = t2 and for values of |t1 − t2 | that are of the order of the total length T ,
{Yt , t ∈ Z} can be considered nonstationary. Obviously, in alliance with the choice
of and m, visual interpretation of recurrence plots requires some experience.
where a > 1 denotes the growth rate at time t of the species in the case
of unlimited natural sources. The factor (1 − Yt−1 ) describes the effect of
over-population. In some cases, a particular solution of (1.22) can be found,
depending on the value of a and the starting value Y0 .
1.3 INITIAL DATA ANALYSIS 21
Figure 1.7: (a) Directed scatter plot at lag 1 for the EEG recordings, and (b) a scatter plot
with the two largest and two smallest values connected with the preceding and the following
observations.
Figure 1.6, top panel, shows the first 200 observations of a time series {Yt }
generated with (1.22) for a = 4. The plot shows an erratic pattern, akin to
that of a realization from some stochastic process. Still, the evolution of {Yt }
is an example of chaos. The recurrence plot for {Yt }200
t=1 is shown in the bottom
panel of Figure 1.6(b).
It is interesting to contrast the main features of graph (b) with the charac-
teristic features of graph (a), showing a recurrence plot of an i.i.d. U (0, 1)
distributed time series, and with the patterns in graph (c), showing a re-
currence plot of the time series Yt + 0.005t. Graph (a) has a homogeneous
typology or pattern, which is an indicator that the series originated from a
stationary DGP. In contrast, a non-homogeneous or disrupting typology, as
with the recurrence plot in graph (c), indicates a nonstationary DGP. Finally,
graph (b) shows a recurrence plot with a diagonal oriented periodic struc-
ture due to the oscillating patterns of {Yt }. This is supported by the plot in
the middle panel. The white areas of bands in the recurrence plots indicate
changes in the behavior of a time series, perhaps due to outliers or structural
shifts. As an exercise the reader is recommended to obtain recurrence plots
for higher values of the embedding dimension m, and see whether or not the
overall observations made above remain unchanged.
7
An obvious three-dimensional extension is to plot (Yt , Yt− , Yt− ) ( = ; = = 1, 2, . . .).
For this purpose the function autotriples in the R-tsDyn package can be used. Alternatively, the
function autotriples.rgl displays an interactive trivariate plot of (Yt−1 , Yt−2 ) against Yt .
22 1 INTRODUCTION AND SOME BASIC CONCEPTS
versions of the JB-test in the i.i.d. case. Koizumi et al. (2009) derive some multivariate JB
tests. Fiorentini et al. (2004) show that the JB test can be applied to a broad class of
GARCH-M processes. Boutahar (2010) establishes the limiting distributions for the JB test
statistic for long memory processes. Kilian and Demiroglu (2000) find that the JB test
statistic applied to the residuals of linear AR processes is too conservative in the sense that
it hardly will reject the null hypothesis of normality in the residuals. Using the same setup
as with the Lin–Mudholkar test statistic, Mudholkar et al. (2002) construct a test statistic
based on the correlation between the sample mean and the third central sample moment.
Section 1.3.2: Nielsen and Madsen (2001) propose generalizations of the sample ACF and
sample PACF for checking nonlinear lag dependence founded on the local polynomial regres-
sion method (Appendix 7.A). Some of the methodology discussed in that paper is implemen-
ted in the MATLAB and R source codes contained in the zip-file comp ex 1 scrips 2011.zip,
which can be downloaded from http://www2.imm.dtu.dk/courses/02427/.
If {Yt }Tt=1 follows a linear causal process, as defined by (1.2), but now the εt ’s are i.i.d.
with mean zero and infinite variance rather than i.i.d. with finite variance, then the sample
T − T
ACF for heavy tailed ∞data, defined∞as
ρY () = t=1 Y t Yt+ / t=1 Y t
2
, still converges to
a constant ρY () = i=0 ψi ψi+ / i=0 ψi2 ( ∈ Z). However, for many nonlinear models
ρY () converges to a nondegenerate random variable. Resnick and Van den Berg (2000a,b)
use this fact to construct a test statistic for (non)linearity based on subsample stability of
ρY (); see the S-Plus code at the website of this book. 8
Section 1.3.3: Several methods have been proposed for the estimation of the mutual in-
formation (Kullback–Leibler divergence) such as kernel density estimators, nearest neighbor
estimators and partitioning (or binning) the XY plane. This latter approach, albeit in a
time series context, is available through the function mutual in the R-tseriesChaos package.
Khan et al. (2007) compare the relative performance of four mutual information estimation
methods. Wu et al. (2009) discuss the estimation of mutual information in higher dimensions
and modest samples (500 ≤ T ≤ 1,000).
Software References
Section 1.2: Becker et al. (1994) introduce the cave plot for comparing multiple time
series. The plot in Figure 1.5 is produced with an S-Plus function written by Henrik Aalborg
Nielsen; see the website of this book. Alternatively, cave plots can be obtained using the R-
grid package. Note, McLeod et al. (2012) provide an excellent overview of many R packages
for plotting and analyzing, primarily linear, time series.
Section 1.3.1: The Jarque–Bera test statistic is a standard routine in many software
packages. The generalized JB test statistic can be easily obtained from a simple modification
of the code for the JB test. GAUSS 9 code for the Bai–Ng tests for skewness, kurtosis,
and normality is available at http://www.columbia.edu/ ~sn2294/research.html. A
MATLAB10 function for computation of theses test statistics can be downloaded from the
website of this book.
Section 1.3.2: FORTRAN77 subroutines for calculating Kendall’s (partial) tau for uni-
variate and multivariate (vector) time series, created by Jane L. Harvill and Bonnie K. Ray,
are available at the website of this book.
Section 1.3.4: The results in Figures 1.6(a) – (c) can be reproduced with the function recurr
in the R-tseriesChaos package. Alternatively, one can analyze the data with the function
recurrencePlot in the R-fNonlinear package. The R-tsDyn package contains functions for
explorative data analysis (e.g. recurrence plots, and sample (P)ACFs), and nonlinear AR
estimation.
User-friendly programs for delay coordinate embedding, nonlinear noise reduction, mutual
information, false-nearest neighbor, maximal Lyapunov exponent, recurrence plot, determ-
inism test, and stationarity test can be downloaded from http://www.matjazperc.com/
ejp/time.html. Alternatively, http://staffhome.ecm.uwa.edu.au/ ~00027830/ contains
MATLAB functions to accompany the book by Small (2005). Another option for applying
nonlinear dynamic methods is the TISEAN package. The package is publicly available from
9
GAUSS is a registered trademark of Aptech Systems, Inc.
10
MATLAB is a registered trademark of MathWorks, Inc.
EXERCISES 25
Exercises
Theory Questions
1.1 Let the ARCH(1) process {Yt , t ∈ Z} be defined by Yt |(Yt−1 , Yt−2 , . . .) = σt εt where
i.i.d.
σt2 = α0 + α1 Yt−12
, and {εt } ∼ N (0, 1).11 Assume α0 > 0 and 0 < α1 < 1. Rewrite
{Yt2 , t ∈ Z} in the form of an AR(1) process. Then show that the error process of the
resulting model does not have a constant conditional variance, i.e. {Yt2 , t ∈ Z} is not
a weakly linear time process.
1.2 Consider the process Yt = βYt−2 εt−1 + εt , where {εt } is an i.i.d. sequence such that
E(εt ) = E(ε3t ) = 0, E(ε2t ) = σε2 , and E(ε4t ) < ∞, and where β is a real constant such
that β 4 < 1. Let ε0 = 0 and Y−1 = Y0 = 0 be the starting conditions of the process.
Show that the ARCH process√in Exercise 1.1 does not satisfy the white noise condi-
tion, i.e. limT →∞ γY−2 (0)Var(
√ Y (1)) increases monotonically from 1 to ∞, as α1
Tγ
increases from 0 to 1/ 3.
(a) Show that I KL (X, Y ) is non-negative, and 0 if and only if X and Y are inde-
pendent.
(b) Suppose there exists a functional h(·) such that X = h(Y ). Show that I KL (X, Y )
= ∞.
1.5 Suppose {Yi }ni=1 is a sequence of i.i.d. random variables of Y with mean zero. If
the rth moment of Y exists, then the semi-invariants or cumulants are defined by
∞
the identity in t exp{ p=1 kp (it)p /p!} = φ(t) with φ(t) the characteristic function.
11
Throughout the book, we assume that the reader is familiar with the class of so-called (gener-
alized) autoregressive conditional heteroskedastic (abbreviated as (G)ARCH) models; see, e.g., the
excellent, and up-to-date, book by Francq and Zakoı̈an (2010).
26 1 INTRODUCTION AND SOME BASIC CONCEPTS
Figure 1.8: Climate change data set. (a) Recurrence plot of the δ 13 C time series, and (b)
recurrence plot of the δ 18 O time series. Embedding dimension m = 3, and = 1.
−3/2
2,Y and μ
In normal samples it can be shown that Y , μ ν,Y μ
2,Y (ν = 3, 4, . . .) are
independent, and hence that
k 6n(n − 1) k 24n(n − 1)2
3 4
Var 3/2 = , Var 2 = .
k2 (n − 2)(n + 1)(n + 3) k2 (n − 3)(n − 2)(n + 3)(n + 5)
(a) Using the above results, show that the exact mean and variance of the sample
coefficient of skewness τY and the sample coefficient of kurtosis κ
Y are, respect-
ively, given by
6(n − 2)
E(
τY ) = 0, Var(
τY ) = ,
(n + 1)(n + 3)
3(n − 1) 24n(n − 2)(n − 3)
E(
κY ) = , κY ) =
Var( .
n+1 (n + 1)2 (n + 3)(n + 5)
(b) Given the results in part (a) define an alternative for the JB test statistic (1.6).
1.6 Figure 1.8(a) displays the recurrence plots of the δ 13 C and δ 18 O time series, respect-
ively; see Example 1.5. Provide a global characterization of each plot, in terms of
homogeneity, periodicity, and trend or drift.
EXERCISES 27
1.7 Figure 1.9 shows raw data plots of length T = 100, together with corresponding
directed scatter plots, for three simulated time series processes:
i) Y t = εt , (Gaussian white noise),
ii) Yt = 0.6Yt−1 εt−1 + εt , (a stationary BL process; see Section 2.2),
iii) Yt = σt εt , σt2 = 1 + 1.2Yt−1
2
, (a nonstationary ARCH(1) process),
i.i.d.
where in all cases {εt } ∼ N (0, 1). The graphs are listed in random order. Which
set of graphs corresponds to the listed processes?
Figure 1.9: Three time series plots and associated directed scatter plots.
In Section 1.1, we discussed in some detail the distinction between linear and non-
linear time series processes. In order to make this distinction as clear as possible,
we introduce in this chapter a number of classic parametric univariate nonlinear
models. By “classic” we mean that during the relatively brief history of nonlinear
time series analysis, these models have proved to be useful in handling many non-
linear phenomena in terms of both tractability and interpretability. The chapter
also includes some of their generalizations. However, we restrict attention to uni-
variate nonlinear models. By “univariate”, we mean that there is one output time
series and, if appropriate, a related unidirectional input (exogenous) time series. In
Chapter 11, we deal with vector (multivariate) parametric models in which there
are several jointly dependent time series variables. Nonparametric univariate and
multivariate methods will be the focus of Chapters 4, 9 and 12.
The chapter is organized as follows. In Section 2.1, we introduce a general non-
linear time series model followed by a representation as a so-called state-dependent
model (SDM). The SDM builds upon the basic structure of the linear ARMA model.
In particular, it generalizes the ARMA model to the nonlinear version by allowing
the coefficients to take on more complex, and hence, flexible forms. As we will
see in Sections 2.2 – 2.5, by imposing appropriate restrictions on the parameters of
the SDM several important classes of nonlinear models emerge. In Section 2.6, we
introduce the class of regime switching threshold models. Basically, these models
can be regarded as piecewise linear approximations to the general nonlinear time
series model of Section 2.1. Next, to allow for slow changes between various states
of the DGP, we discuss smooth transition models in Section 2.7. In Section 2.8,
we introduce some nonlinear non-Gaussian models. Section 2.9 deals with artificial
neural networks (ANNs) which are useful for DGPs that have an unknown functional
form. In Section 2.10, we focus on Markov switching models where the regimes are
determined by an unobservable process. In the final section, we illustrate a number
of practical issues of ANN modeling via a case study.
In addition, the chapter contains two appendices. In Appendix 2.A, we briefly in-
troduce the concept of (non)linear impulse response functions. We will see that these
response functions are a convenient tool for illustrating the dynamics of (non)linear
time series models. Appendix 2.B provides a list of abbreviations for threshold-type
nonlinear models which have been introduced in the literature since the early 1970s.
which is independent of future observations and due to its generality may be con-
sidered as a nonlinear model. Model (2.1) is also referred to as causal or non-
anticipative in the sense that future values, which typically are not available, do not
participate in the functional form of the model.
Now we face the problem of finding h(·) such that (2.1) is causally invertible, i.e.
it can be “solved” for Yt as a function of {. . . , εt−2 , εt−1 , εt },
Yt =
h(εt , εt−1 , εt−2 , . . .). (2.2)
In addition, while maintaining their generality, the functions h(·) and h(·) must be
tractable for the purpose of statistical analysis. However, as (2.2) stands not much
can be said or done as far as analysis of a given time series is concerned. Therefore,
we assume that h(·) is a sufficiently well-behaved function so that we can expand
(2.2) in a Taylor series about some fixed time point – say 0 = (0, 0, . . .) . Then we
can write
∞
∞
∞
Yt = μ + gu εt−u + guv εt−u εt−v + guvw εt−u εt−v εt−w + · · · , (2.3)
u=0 u,v=0 u,v,w=0
where
∂h ∂ n h
μ = g(0), gu1 = , · · · , gu1 ,...,un = .
∂εt−u1 0 ∂εt−u1 · · · ∂εt−un 0
kernels.1 The first two terms in (2.3) correspond to a linear causally invertible
model.
One may also consider the dual Volterra series , which is obtained by a Taylor
series expansion applied to (2.1) – assuming invertibility of
h(·) and smoothness of
h(·) – to obtain
∞
∞
∞
εt = μ + gu Yt−u +
guv Yt−u Yt−v +
guvw Yt−u Yt−v Yt−w + · · · , (2.4)
u=0 u,v=0 u,v,w=0
p
p
p
μ + gu Yt−u +
guv Yt−u Yt−v +
guvw Yt−u Yt−v Yt−w + · · · =
u=0 u,v=0 u,v,w=0
q q q
μ+ gu εt−u + guv εt−u εt−v + guvw εt−u εt−v εt−w + · · · , (2.5)
u=0 u,v=0 u,v,w=0
Note that (2.7) treats {εt } as an observable input; therefore, the input-output rela-
tionships are expressed in terms of a finite number of past inputs and outputs. 2
When {εt } is unobservable and instead is taken as a random variable, we may
reduce the observed time series {Yt } into a strict WN series by redefining G(·) as
t−1 , . . . , Yt−p , εt−1 , . . . , εt−q ) + εt .
Yt = G(Y (2.8)
so defined, {εt } is considered as the innovation process for {Yt }, while G(·)
With G(·)
defines the relevant information on Yt which is contained in past values of {Yt } and its
t−1 , . . . , Yt−p , εt−1 , . . . , εt−q ).
innovation process {εt }. Observe that E(Yt |F t−1 ) = G(Y
Clearly, the above formulation is not restricted to the case where {εt } is unobserv-
able. It can also be adopted to the case where {εt } is a controlled input variable
which may enter the model linearly as a factor influencing current output {Yt }.
1
Named in honor of Vito Volterra, who studied integral equations involving kernels of this form
in the first half of the 20th century.
2
In neural network studies the Volterra expansion with finite sums is often called the
Kolmogorov–Gabor polynomial, or alternatively the Ivakhnenko polynomial.
32 2 CLASSIC NONLINEAR MODELS
where
p
q
Yt = μ(St−1 ) + φi (St−1 )Yt−i + εt + θj (St−1 )εt−j . (2.10)
i=1 j=1
Model (2.10) has been introduced by Priestley (1980). It is called the state-
dependent model (SDM) of order (p, q) and may be regarded as a local linearization
of the general nonlinear model (2.9). The unknown parameters of the model are
φi (·) (i = 1, . . . , p), θj (·) (j = 1, . . . , q), the “local mean” μ(·), all of which depend
on the state S of the process at time t − 1, and σε2 .3
Due to the characterization of the SDM as a locally linear ARMA model we
impose a pair of ‘identifiability’ like conditions of the following form.
(i) The polynomials {1 − pi=1 φi (x)z i } and {1 + qj=1 θj (x)z j } have no common
factors for all fixed vectors x, and all their roots lie outside the unit circle.
The generality of (2.10) becomes more apparent as one imposes certain restric-
tions on μ(·), φi (·), and θj (·). One simple case is to take all these parameters as
constants, i.e. independent of St−1 . Then (2.10) becomes the well-known linear
ARMA(p, q) model. Some more elaborate characterizations of (2.10) are introduced
in the following Sections.
3
In fact, an equivalent vector state space representation of (2.10) is easily written down.
2.2 BILINEAR MODELS 33
i.i.d.
Figure 2.1: (a) A realization of {εt }500
t=1 with {εt } ∼ N (0, 1), and (b) a realization of the
BL(1, 0, 1, 1) model (2.14), for parameter combination (φ = 0.5, ψ = 0.2), with the generated
WN series in panel (a) as input.
p
q
q
Q
Yt = φ0 + φi Yt−i + εt + θj εt−j + ψjv Yt−j εt−v . (2.11)
i=1 j=1 j=1 v=1
This is a special case of a general bilinear (BL) model of order (p, q, P, Q) where P
is constrained to be equal q. The general BL model 4 is defined as
p
q
P
Q
Yt = φ0 + φi Yt−i + εt + θj εt−j + ψuv Yt−u εt−v . (2.12)
i=1 j=1 u=1 v=1
This model is linear in the Yt ’s and also in the εt ’s separately but not in both. In
other words, provided ψuv = 0, the ARMA(p, q) model is nested within (2.12). The
following example illustrates this feature.
where ψ = ψ11 . This process is stationary and ergodic if φ2 + ψ 2 σε2 < 1; see
Chapter 3. Its mean is E(Yt ) = ψσε2 . Notice that (2.13) can be rewritten as
Equation (2.14) looks like a linear AR(1) process except that the AR parameter
φ + ψεt−1 is now time dependent, i.e. it may be viewed as a random variable
with mean φ. If ψ is positive, the AR parameter will increase with positive
values of εt−1 and decrease with negative values of εt−1 . However, positive
shocks will be more persistent than negative shocks in the sense that they
have a more sizeable effect on the conditional variability of {Yt , t ∈ Z}.
To illustrate this point, we simulate (2.14) with parameter combinations (φ =
0.5, ψ = 0.2) and (φ = 0.5, ψ = 0), with the second process nested within
the BL process. For both processes, we generate an identical set of i.i.d.
N (0, 1) random numbers. Figures 2.1(a) – (b) show T = 500 realizations of,
respectively, {εt } and the BL process {Yt , t ∈ Z}. Since ψ is positive, it can be
seen that the value of {εt−1 } has a direct effect on the value of {Yt } but that
this effect is larger for positive than for negative shocks, with values of {Yt }
in the range [−3.45, 5.59]. In contrast, the AR(1) process is having values in
the range [−3.70, 3.45].
P
Q
Yt = εt + ψuv Yt−u εt−v . (2.15)
u=1 v=1
5
The terms super and sub are not quite natural, because it is purely by convention if lags in
{Yt , t ∈ Z} correspond to the first index (u) and lags in {εt } correspond to the second index (v).
2.2 BILINEAR MODELS 35
Figure 2.2: (a) – (d) Realizations of the processes (2.16) – (2.19), respectively; (e) Gen-
eralized impulse response functions (GIRFs) for both diagonal and subdiagonal models (blue
medium dashed line), and superdiagonal model (red solid line) for a unit-shock at t = 1; (f )
GIRFs for both diagonal and superdiagonal models (blue medium dashed line) and subdiag-
onal model (red solid lines) for a permanent shock δ of magnitude −0.01, 0.02, and 1 at
time t = 1.
Figures 2.2(a) – (d) show plots of the time series. The linear AR(1) model, as
a simple “baseline” specification, exhibits some evidence of long-term drift-like
behavior, consistent with the fact that this model is close to a random walk.
In marked contrast, model (2.17) exhibits two large, highly localized bursts;
similar to the extreme peaks in Figure 1.3. Also, note that the series seems
to have a sample mean zero, which is consistent with the result E(Yt ) = 0
36 2 CLASSIC NONLINEAR MODELS
established in Exercise 1.2. The series generated by the diagonal model also
exhibits a sample mean zero, but here the general character of the series is
quite different from the subdiagonal case. In particular, we see many isolated
negative bursts, occurring frequently enough to achieve a non-zero (specifically,
negative) sample mean, which is agreement with the fact that E(Yt ) = −0.5.
Iterating each BL model, we get the following response functions for the three
models:
Figure 2.2(e) shows these responses for the case φ = 0.99 and ψ = −0.5. Note,
the series generated by the superdiagonal model appears to exhibit somewhat
similar behavior to the diagonal model. In contrast, the GIRF of the superdi-
agonal model defined by equation (2.19) is different from the other two mod-
els. In fact, the response functions of models (2.16) – (2.18) are identical (blue
medium dashed line). For the superdiagonal model the term −0.5Yt−1 εt−2
is non-zero for t = 2, and hence has a direct effect on the impulse response
function for t > 2 (red solid line).
Figure 2.2(f) presents a global picture of what happens when each of the three
BL models are hit by a permanent shock δ at time t = 1. The step responses for
δ = −0.01, 0.02, and 1 for the diagonal and superdiagonal models are identical
(blue medium dashed line). In fact, both step responses are described by an
equivalent AR(1) process with parameter φ + ψδ. The subdiagonal model
(2.17), on the other hand, exhibits much faster step responses (red solid lines).
There is a slight overshoot for this model, reflecting the fact that its equivalent
linear model is an AR(2) process, i.e. Yt = 0.99Yt−1 − 0.5δYt−2 + εt .
Figure 2.3: (a) A realization of the ExpAR(1) model (2.23) with ξ = −0.95 and corres-
ponding histogram; (b) A realization of the ExpAR model (2.23) with ξ = 0.95 and corres-
ponding histogram; T = 100.
In addition to (2.21), a necessary (but not sufficient) condition for the existence
of a limit cycle of the ExpAR(p) process is that at least one of the roots of
lies outside the unit circle. Example 2.4 illustrates this feature of the ExpAR process
via MC simulation.
Figure 2.3 shows T = 100 observations from (2.23) with ξ = −0.95 and ξ =
0.95, respectively, with corresponding histograms below each graph. Both time
plots demonstrate the two types of amplitude-dependent frequency, i.e. in-
creasing and decreasing frequency. For both values of ξ condition (2.21) is sat-
isfied. However, only in the case ξ = −0.95, a limit cycle exists. Indeed, it fol-
lows directly from the above definition that the skeleton of (2.23), i.e., its noise-
free (εt ≡ 0) representation, has a limit cycle (τ1 , τ2 ) = (−1.50043, 1.50043).
Still the up- and down patterns in both time series plots are very similar.
Both histograms show a bimodal distribution with light and short tails, which
2.4 RANDOM COEFFICIENT AR MODEL 39
p
Yt = μ + {φi + βi,t }Yt−i + εt , (2.24)
i=1
where {Bt = (β1,t , . . . , βp,t ) } is a sequence of i.i.d. random vectors with zero mean
E(Bt ) = 0 and Cov(Bt ) = Σβ , and {Bt } is independent of {εt }.
Model (2.24) is termed a random coefficient AR (RCAR) model of order p. If
p = 1, a necessary and sufficient condition for second-order stationarity is that
φ2 + σβ2 < 1; see Anděl (1976, 1984) for more complicated stationarity conditions
when p > 1. Note, by introducing random coefficients to an ARMA model, we
can generalize the RCAR model. Alternatively, by assuming the coefficients βi,t
are not independent but follow an arbitrary strictly stationary stochastic process
(say an MA process) defined on the same probability space as {εt }, one obtains the
so-called doubly stochastic model (Tjøstheim, 1986a,b).
i.i.d.
Figure 2.4: (a) A realization of the NLMA model (2.26) with {εt } ∼ N (0, 1), β = 0.5
and T = 250; (b) Four permanent step response functions.
q
Yt = εt + θj,i1 (St−1 )εt−ji1
j=1
Q
Q
Q
= εt + βi1 εt−i1 + βi1 ,i2 εt−i2 εt−2i1 + · · ·
i1 =0 i1 =0 i2 =0
Q
Q
Q
+ ··· βi1 ,i2 ,...,iq εt−i2 εt−i3 · · · εt−ηiq , (2.25)
i1 =0 i2 =1 iq =0
where η is the highest order of summations. The model is termed nonlinear moving
average (NLMA) of order (Q, q).
Note, a similar NLMA representation follows from restricting the Volterra ex-
pansion (2.5).
(J )
p
q
Yt = φ0 t + φu(Jt ) Yt−u + εt + θv(Jt ) εt−v , (2.27)
u=1 v=1
(J ) (J )
where {εt } ∼ (0, σε2 ), and the coefficients φu t (u = 1, . . . , p), θv t (v = 1, . . . , q)
i.i.d.
are constants. For each t, the process {Jt } acts as the switching mechanism between
the k regimes. The process can be observable, hidden, or a combination of both.
Writing Yt = (Yt , . . . , Yt−p+1 ) , a canonical (vector) form of (2.27) is given by
where, for Jt = i,
(i) (i) (i)
φ1 . . . φp−1 φp (a companion
C(i) = (φ0 , 0, . . . , 0) , Φ(i) =
(i)
Ip−1 0(p−1)×1 matrix)
(i) (i)
θ1 ... θq
Θ(i) = , εt = (εt , . . . , εt−q+1 ) ,
O(p−1)×q
φ(i)
u = 0 for u = pi + 1, pi + 2, . . . , p, and p = max(p1 , . . . , pk , d),
θv(i) = 0 for v = qi + 1, qi + 2, . . . , q, and q = max(q1 , . . . , qk ).
Assume that the indicator variable Jt takes the value i if Yt−d ∈ R(i) . 6 Then the
general SETARMA is defined as
k
pi
qi
(i) (i) (i)
Yt = φ0 + φ(i)
u Yt−u + εt + θv(i) εt−v I(Yt−d ∈ R(i) ), (2.29)
i=1 u=1 v=1
(i)
where εt = σi2 εt , and {εt } ∼ (0, 1). Note that (2.29) may be viewed as a general-
i.i.d.
(i)
ization of a nonhomogeneous linear ARMA model since the noise variances Var(εt )
are different for different i.
Example 2.6: Dynamic Effects of a SETAR Model
To illustrate the effect of a one-unit shock or a permanent shock on {Yt , t ∈ Z},
it is instructive to consider the SETAR(2; 1, 0) model with threshold parameter
r and delay d = 1, i.e.
2Yt−1 + εt if |Yt−1 | ≤ r,
Yt = (2.30)
εt if |Yt−1 | > r,
where {εt } ∼ (0, σε2 ). We see that the model switches between a locally
i.i.d.
i.i.d.
Figure 2.5: (a) A realization of model (2.30) with r = 2, T = 250, and {εt } ∼ (0, 1); (b)
Impulse response function for a one-unit shock at time t = 1; (c) Permanent step responses
for δ = 0.1 and δ = 1; (d) Permanent step responses for δ = 2 and δ = 10.
Figure 2.5(b) shows the impulse response function of (2.30) for a one-unit
shock at time t = 1 when r = 2, and Y0 = 0. More generally, for an impulse
response of magnitude δ, initially Yt = 0 for t ≤ 0, while Y1 = δ. Next, for
0 < δ ≤ r, the resulting responses are {2δ, 22 δ, . . . , 2n δ, 0, . . . , 0}, where n is
the largest integer satisfying 2 n δ ≤ r. If δ > r, it follows that Y1 = δ and
Yt = 2Yt−1 + εt = 0 for t ≥ 2. Consequently, the impulse response function
exhibits a one sample duration for δ > r.
where
φ0 = φ0 + rφd , φ−
(1) (1) (1) (2)
d = φd , φd = φd , and φu = φu for u = d.
+ (1)
p
E(Yt ; θ|F t−1 ) = φ0 + φu Yt−u + φ− −
d (Yt−d − r) + φd (Yt−d − r) ,
+ +
(2.32)
u=1,u=d
where F t is the σ-algebra generated by {Ys , s ≤ t}, and where (y)− = min(0, y)
and (y)+ = max(0, y). Observe that the right-hand side of (2.32) can be written as
p
u=1 gu (Yu ) where gu (·) (u = d) are linear functions and gd (·) is piecewise linear.
7
The class of CSETAR(MA) models should not be confused with the class of continuous-time
threshold ARMA models which may be viewed as a continuous-time analogue of ( 2.29); see, e.g.,
Brockwell (1994).
2.6 THRESHOLD MODELS 45
Figure 2.6: Scatter plot of a typical realization of the CSETAR model (2.33) with the true
AR functions overlaid (black solid lines); T = 500.
where {εt } ∼ N (0, 1). Figure 2.6 shows a scatter plot of Yt versus Yt−1 for a
i.i.d.
typical simulated time series of length T = 500, and the true AR functions are
overlaid. Given (2.32), the CLS parameter estimates follow from minimizing
the sum of squared residuals following similar steps as in Algorithm 6.2; see
also Chan and Tsay (1998). For the simulated series, we obtain the fitted
model
0.56(0.06) (Yt−1 − 0.72(0.21) ) if Yt−1 ≤ 0.72(0.21) ,
Yt = 1.02 + (2.34)
(0.11) −0.48 (0.12) (Yt−1 − 0.72(0.21) ) if Yt−1 > 0.72(0.21) ,
where the asymptotic standard errors of the parameter estimates are in paren-
2 = 3.98. The
1 = 1.08 and σ
theses. The standard errors of the residuals are σ
sample sizes for the two regimes are 303 and 196, respectively. Comparing
(2.33) and (2.34), we see that the two models are similar. The closeness in
absolute value of the two lag-one coefficients in (2.34) is indicative of using a
CSETAR model; see Gonzalo and Wolf (2005) for a formal test statistic.
the threshold variables is linear, but unknown. For ease of explanation we formulate
the resulting model in terms of a SETAR specification. First, we introduce a general
framework.
Consider an m-dimensional Euclidean space Rm and a point x in that space.
Let ω = (ω1 , . . . , ωm ) denote an m-dimensional unknown parameter vector. These
parameters define a hyperplane as follows H = {x ∈ Rm |ω x = r}, where r is a
scalar. The direction of ω determines the orientation of the hyperplane whereas r
represents the position of the hyperplane in terms of its distance from the origin.
The hyperplane H induces a partition of the space into two regions defined by the
half spaces H− = {x ∈ Rm |ω x ≤ r} and H+ = {x ∈ Rm |ω x > r}. In terms of the
indicator function I(·), this partition is given by I(x) = 1 if x ∈ H− and 0 otherwise.
Now, assume that an m-dimensional space is spanned by the vector of time
series values X t−1 = (Yt−1 , . . . , Yt−m ) . Further, suppose that there are k functions
t−1 ≤ ri ) (i = 1, . . . , k) where ωi = (ω (i) , . . . , ωm
I(ωi X
(i)
) and ri are real parameters.
1
Thus, each of these functions defines a threshold. Then a SETAR model with m
(1 ≤ m ≤ p) thresholds and order (k; p, . . . , p), denoted by SETAR(k; p, . . . , p)m , is
defined as
p k
p
Yt = φ0 + φu Yt−u +
(i)
ξ0 + t−1 ≤ ri ) + εt
ξu(i) Yt−u I(ωi X
u=1 i=1 u=1
k
= φ Xt−1 + t−1 ≤ ri ) + εt ,
ξi Xt−1 I(ωi X (2.35)
i=1
where
Note that (2.35) is not identified. For identification purpose, we impose the restric-
tion r1 ≤ · · · ≤ rk . Further, due to the fact that I(x) = 1 − I(−x), a convenient
normalization condition is to set one element of ωi equal to unity.
where ω1 = (1, −1) , ω2 = (0, 1) , and X t−1 = (Yt−1 , Yt−2 ) . Thus the
dynamics of (2.36) is controlled by two threshold functions. The first one is
a bi-dimensional threshold when Yt−1 − Yt−2 = 0. The second one is a single
threshold when Yt−2 = 0. Figure 2.7(a) shows the threshold boundaries. 8
8
Tiao and Tsay (1994) generalize the single threshold SETAR to a similar model as in ( 2.36)
with known parameters ωi (i = 1, 2).
2.6 THRESHOLD MODELS 47
Figure 2.7: (a) Threshold boundaries of model (2.36); (b) Scatter plot of Yt−2 versus Yt−1
i.i.d.
with two separating hyperplanes (red solid lines); T = 500, {εt } ∼ N (0, 1).
and T = 500. The solid lines denote the two separating hyperplanes.
p
p
q
q
Yt = φ0 + φ+ +
i Yt−i + φ− −
i Yt−i + εt + θj+ ε+
t−j + θj− ε−
t−j . (2.37)
i=1 i=1 j=1 j=1
48 2 CLASSIC NONLINEAR MODELS
Figure 2.8: Impact of a maintained unit shock from zero to one onwards from t = 10
(MA(+), asMA(+), blue solid lines) and a corresponding negative unit shock (MA(−),
asMA(−), red solid lines ) on the series {Yt }. From Brännäs and De Gooijer (1994).
p
q
− −
where αi = φ+i − φi (i = 1, . . . , p), βj = θj − θj (j = 1, . . . , q). We see that the
+ 9
asAR and asMA parts add two weighted sums of positive innovations to a conven-
tional ARMA model. In addition, we see that (2.38) belongs to the class of threshold
models with I(εt−i > 0) (i = 1, . . . , max(p, q)) controlling the transition between the
two regimes.
Brännäs and De Gooijer (1994) fitted the above model successfully to quarterly
growth rates in U.S. real GNP, using first differences of logged values of the
original series. Evidence of asymmetry may be noted from the sign and mag-
nitude of the parameter values. For instance, at lag 22 the response to a
9
If there is a threshold value r = 0 in the ε±
t functions, it can be accounted for by including a
constant term in (2.38) and retaining r = 0 as a threshold value.
2.6 THRESHOLD MODELS 49
where εt denotes the tth residual. For the MA(3) model a positive or negative
shock has, apart from a change in sign, a similar effect on {Yt }. On the other
hand, for model (2.39), asymmetry is clearly present in the resulting series.
There is a more rapid decline to a lower level for a negative shock than there
is an increase to a higher level for a positive shock.
Note that the graph only gives the two most extreme outcomes out of 5 2 = 25
possible parameter combinations. Each combination corresponds to a partic-
ular sequence of positive and negative innovations. There is equal probability
for each combination when the innovations are i.i.d. from a symmetric distri-
bution. Each combination of an asMA model can be given a corresponding
AR representation. With 25 combinations, equally many AR representations
will arise. These can be seen as a reasonable approximation to, for instance,
a STAR model, discussed in Section 2.7.
k1
i,2
(i,j)
Yt = φ0 + φ(i,j)
s Yt−s + ξu(i,j) Xt−u
i=1 j=1 s u
(i,j)
+ ηv(i,j) Zt−v + εt + (i,j)
θw εt−w I(Xt−d2 ∈ R(i,j) ) I(Yt−d1 ∈ R(i) ),
v w
(2.40)
(i,j) i.i.d. 1
where {εt } ∼ (0, 1). Clearly, (2.40) consists of ki=1 i,2 regimes.
Several (non)linear models emerge as special cases of (2.40):
Figure 2.9: Effects of various values of the smoothness parameter γ on (a) the logistic
transition function (2.43), and (b) the exponential transition function (2.44). Both functions
with c = 0 and d = 1.
• The ESTAR transition function is symmetric about c in the sense that the local
dynamics are the same for high as for low values of Yt−1 , whereas the mid-range
behavior, for values close to c, is different. Thus, the distance between Yt−1
and c matters, but not the sign. For the LSTAR model, the local dynamics
depends on the distance between Yt−1 and c, as well as the sign.
Note that an asMA model of Section 2.6.5, contains 2q separate MA(q) re-
gimes. In some cases, it may also seem plausible to think of a continuum of MA
regimes and that the transition from one extreme regime to the other is smooth.
This requires modifying the transition function I(εt−j ≥ 0) into a smooth function
Gj (γεt−j ) (γ > 0; j = 1, . . . , q). Since the transition function multiplying εt−j has
εt−j as its argument ∀j, the resulting nonlinear model is additive in structure. For
instance, setting p = 0, an additive smooth transition moving average (ASTMA)
model of order q is given by
q
Yt = εt + θj + δj Gj (γεt−j ) εt−j . (2.45)
j=1
Here {αi }pi=0 is a non-negative sequence whose elements sum up to one. Let β (0) (≡
0), β (1) , . . . , β (p) be p + 1 constants, satisfying 0 ≤ β (j) ≤ 1 (1 ≤ j ≤ p). Under the
above restrictions the SDM reduces to
where
Et with prob. p1 = (1 − β)/(1 − (1 − α)β)
εt = (2.48)
(1 − α)βEt with prob. 1 − p1 = αβ/(1 − (1 − α)β)
54 2 CLASSIC NONLINEAR MODELS
0 with prob. (1 − α)
Jt = (2.49)
1 with prob. α,
−0.9 0.4
Figure 2.10: (a) A realization of the PAR(2) model Yt = (0.3Yt−1 + 0.5Yt−2 )εt , with
i.i.d.
{εt } ∼ N (1, 0.1), and T = 500; (b) Sample ACF of the time series in (a) with 95%
asymptotic confidence limits (blue medium dashed lines).
Note, the ACF depends only on the moments of the stationary marginal distribu-
tion. In the particular case of the gamma distribution such moments exist, and this
distribution is the only one for which the PAR(1) model has the same ACF structure
as an AR(1) process (McKenzie, 1982), hence its name.
More generally, the PAR(p) (p ≥ 2) model with non-additive noise is defined as
p
αi
Yt = Vt φi Yt−i . (2.52)
i=1
Figure 2.10(a) shows a realization of a PAR(2) process, and 2.10(b) its corresponding
sample ACF. We see that the pattern of the sample ACF is compatible with the
sample ACF of an AR(2) model.
56 2 CLASSIC NONLINEAR MODELS
where xi is the value of the ith input node, α0j is a constant (the “bias”), the
summation i→j means summing over all input nodes feeding to j, and ωij are the
connecting weights. The nonlinearity enters the model through the activation-level
function Gj (·), usually a “smooth” transition function such as the logistic function
in (2.43).
For the output layer, the node is defined as
o = ψ α0o + ωjo hj , (2.54)
j→o
Let m be the number of input units, and k the number of nodes in the hidden
layer. Then, the network weight vector, say θ, consists of a (k+1)×1 vector of biases
(α0o , α0j ) , an mk ×1 vector of input layer to hidden layer weights (ω 1 , . . . , ω k ) with
ω j = (ω1j , . . . , ωmj ) (j = 1, . . . , k), and a k ×1 vector of hidden layer to output layer
weights (ω1o , . . . , ωko ) . Thus, for an m–k–1 network the total number of weights,
or dimension of θ, is equal to r = (m + 1)k + (k + 1). Usually the weight vector θ
2.9 ARTIFICIAL NEURAL NETWORK MODELS 57
y Output layer
x1 x2 Input layer
Figure 2.11: The architecture of a single hidden layer ANN with two input units, three
hidden units, and one output unit, a so-called m − k − 1 = 2 − 3 − 1 feed-forward network
with 13 weights.
Thus, when ψ(·) is a linear activation-level function, there are direct linear connec-
tions from the input to the output nodes.
The weights θ are the adjustable parameters of the network, and they are ob-
tained through a process called training. Let {(xi , yi )}Ni=1 denote the training set,
where xi denotes a vector of inputs, and yi is the variable of interest. The object-
ive of training is to determine a mapping from the training set to a set of possible
weights so that the network will produce predictions yi , which in some sense are
“close” to the yi ’s. For a given network, let o(xi ; θ) be the output for a given xi .
Then by far the most common measure of closeness is the ordinary least squares
function, i.e.
N
LN (θ) = {yi − o(xi ; θ)}2 .
i=1
Assume that the network weight space Θ is a compact subset of the r-dimensional
Euclidean space Rr , which ensures that the true ANN model is locally unique with
58 2 CLASSIC NONLINEAR MODELS
regard to the objective function used for training. Then the weights are found as:
= arg min{LN (θ)},
θ
θ∈Θ
using some kind of iterative minimization scheme. A popular method is the back-
propagation algorithm, i.e. a gradient descent algorithm where the computations are
ordered in a simple fashion by taking advantage of the special structure of an ANN.
Yt = h(Xt−1 ; θ) + εt ,
k
= φ0 + φ Xt−1 + ξj G(ω j Xt−1 − cj ) + εt , (2.57)
j=1
where h(·) denotes a hidden layer containing k nodes, with no activation-level func-
tion at the output unit, with hidden activation-level function G(·): R → R, a Borel-
measurable function of the input vector Xt−1 = (Yt−1 , . . . , Yt−p ) , and with the
network weight vector θ ∈ R(p+2)k+p+1 defined as
θ = (φ , ξ , ω , c , φ0 ) ,
where
In ANN terminology the elements of the p × 1 vector φ are called the shortcut
connections, the k × 1 vector ξ consists of the hidden unit to output connections,
the elements of the k × 1 vector c are called the hidden unit “bias” weights, and
the elements of the pk × 1 vector ω are the so-called input unit to hidden unit
connections. Thus, jointly with the intercept φ0 , the dimension r of the network
weight vector θ is equal to (p + 2)k + p + 1. Note, (2.57) does not include lags of
{εt } in the set of input variables, and therefore is a feed-forward ANN.
Now, assume that the activation-level function is bounded, i.e. is |G(x)| < δ < ∞
∀x ∈ R. Let φ(z) be the characteristic function associated with the shortcut connec-
tions. Then it can be shown (Trapletti et al., 2000) that the condition φ(z) = 0 ∀z,
|z| ≤ 1 is sufficient, but not necessary for the ergodicity of the Markov chain {Yt }.
Furthermore, if this condition holds, then {Yt , t ∈ Z} is geometrically ergodic (see
13
Analogue to the notation introduced for SETAR models, we refer to the number of regimes
k first, and to the order p, . . . , p of the AR–NN model second. In contrast, some books use the
notation AR–NN(p, k).
2.9 ARTIFICIAL NEURAL NETWORK MODELS 59
Figure 2.12: Skeleton h(Xt−1 ; θ) of the AR–NN(2; 0, 1) model (2.58) for 25 iterations of
{Yt } for each value of ξ = 1, 1.1, . . . , 24.9, 25.15
Section 3.4.2) and the associated AR–NN process is called asymptotically stationary .
Typical choices for G(·) are the hyperbolic tangent (tanh) function and the logistic
function.
Certain special cases of the AR–NN model are of interest. If the sum in (2.57)
vanishes, then the model reduces to a linear AR(p) model. For k > 0, this can be
achieved by either setting ξj = 0 or ω j = 0 ∀j. For the latter case, the sum is a
constant, independent of Xt−1 , and can be absorbed in the intercept φ0 .
15
This type of graph is commonly referred to as a bifurcation diagram in the chaos literature.
The skeleton is the underlying dynamical system, i.e. the process without noise.
60 2 CLASSIC NONLINEAR MODELS
3
h(Xt−1 ; θ) = 1 − 0.5Yt−1 + G(Yt−1 ; ω1j ), (2.59)
j=1
where
Figure 2.13 shows (2.59) as a function of the input series {Yt−1 }, with Yt−1
taking values in the set {−3, −2.9, . . . , 2.9, 3} (blue solid line). The values of
the activation-level functions G(Yt−1 ; ω1j ) (j = 1, 2, 3) are displayed as blue
dashed-dotted, dashed-doted-doted, and dotted lines, respectively.
For Yt−1 < −1 all three logistic activation-level functions are approximately
equal to zero in value, so the behavior of (2.59) is determined largely by the
slope of the linear activation-level function. For approximately −1 ≤ Yt−1 ≤
0.7 the function G(Yt−1 ; ω12 ) slowly starts increasing, but the values of the
functions G(Yt−1 ; ω11 ) and G(Yt−1 ; ω13 ) remain approximately equal zero. As
a result, the downward trend of h(Xt−1 ; θ) levels off. At about Yt−1 = 0.8,
the function G(Yt−1 ; ω13 ) changes from 0 to 1 fairly rapidly, and the value
of the skeleton increases. Next, for approximately 1.2 < Yt−1 ≤ 1.7, the
skeleton resumes its gradual declining, owing to the fact that G(Yt−1 ; ω12 )
and G(Yt−1 ; ω13 ) essentially achieve their maximum values while the function
G(Yt−1 ; ω11 ) is still not very active. Then, at about Yt−1 = 1.8, the function
G(Yt−1 ; ω11 ) begins to activate, resulting in a slow increase of h(Xt−1 ; θ) up
till about the point Yt−1 = 2.3. Finally, for Yt−1 ≥ 2.4 all three logistic
functions are approximately equal unity. So, once again, the linear activation-
level function causes the gradual decline of the function h(Xt−1 ; θ).
Figure 2.13: Skeleton h(Xt−1 ; θ) of an AR–NN(3; 1, 1, 1) model (2.59) (blue solid line).
The values of the logistic functions G(Yt−1 ; ω1j ) (j = 1, 2, 3) are shown as blue dashed-dotted,
dashed-dotted-dotted, and dotted lines, respectively.
to the symmetries in the ANN architecture the value of the likelihood function
remains unchanged if the hidden units are permuted, resulting in k! possibilities for
each one of the coefficients of the model. This problem is resolved by imposing the
restrictions c1 ≤ · · · ≤ ck or ξ1 ≥ · · · ≥ ξk . The second characteristic is caused
by the fact that G(x) = 1 − G(−x), where G(·) is the logistic function. This
problem can be circumvented, for instance, by imposing the restriction ω1j > 0
(j = 1, . . . , k). Finally, the presence of irrelevant hidden units in the nonlinear part
of the AR–NN model can be eliminated by assuming that each hidden unit makes
a unique non-trivial contribution to the overall AR–NN process, i.e. ξj = 0, ωj = 0
∀j (j = 1, . . . , k), and (ω i , ci ) = ±(ω j , cj ) ∀i = j (i, j = 1, . . . , k). In practice, these
latter assumptions are a part of the model specification stage, applying statistical
inference techniques.
where
k
h(Xt−1 , et−1 ; θ) = φ0 + φ Xt−1 + ψ et−1 + ξj G(ω j Xt−1 + ϑj et−1 − cj )
j=1
with the activation-level function G(·) as introduced in Section 2.9.1, an observed in-
put vector Xt−1 = (Yt−1 , . . . , Yt−p ) , and a q × 1 input vector et−1 = (et−1 , . . . , et−q )
with a feedback through a linear MA-polynomial ϑj (j = 1, . . . , k) for filtering past
residuals. In ANN terminology this feature means that the ARMA–NN network is
recurrent : future network inputs depend on present and past network outputs.
62 2 CLASSIC NONLINEAR MODELS
Yt − et = Yt − ot
ot
Figure 2.14: A typical recurrent ARMA–NN(3; 2, 1) model with two lagged variables Yt−1
and Yt−2 and one recurrent variable et−1 in the set of inputs; ot denotes the network output
at time t, and B is the backward shift operator.
where B(X B ) is defined as the difference between two opposed logistic func-
t−1 ; θ
j
tions, i.e.
1
B(X B ) = −
t−1 ; θ
j
1 + exp(−γj [ω t−1 − c1j ])
j X
1
− , (2.62)
X
1 + exp(−γj [ω t−1 − c2j ])
j
and where θ Lj = (ω j , γj , c1j , c2j ) with ω j = (ω1j , . . . , ωpj ) , γj the slope para-
meter, and (c1j , c2j ) (j = 1, . . . , k) the location parameters. Similarly, θ B =
j
j = (ω1j , . . . , ωqj ) .
ω j , γj , c1j , c2j ) with ω
(
Let q = p. Then a special case of (2.61) is the local linear global neural network
of order p, or L2 GNN(k; p) model, where the approximation functions are assumed
to be linear, that is, L(Xt−1 ; θ Lj ) = ξ0j + ξj Xt−1 with ξj = (ξ1j , . . . , ξpj ) . The
L2 GNN(k; p) model resembles the structure of the AR–NN(k; p) model (2.57), and
is defined as
k
Yt = (ξ0j + ξj Xt−1 )B(Xt−1 ; θ Bj ) + εt , (2.63)
j=1
where, similar to the AR–NN of Section 2.9.1, restrictions on the parameters need
to be imposed to ensure identifiability. Further, it is easy to verify that (2.61) is
related to the SETAR(k; p, . . . , p)m model of Section 2.6.4, with a similar geometric
interpretation.
Figure 2.15: (a) Skeleton (the combined approximation and activation-level function) of
the L2 GNN(2; 1, 1) model (2.64) (blue solid line) with activation-level functions B(Yt−1 ; θ B1 )
(blue medium dashed line) and B(Yt−1 ; θ B2 ) (blue dotted line); (b) A typical realization of
the L2 GNN(2; 1, 1) model (2.64); T = 200.
and {εt } ∼ N (0, 1). Note that (2.64) is composed of a nonstationary AR(1)
i.i.d.
process, given by the linear approximation function L(Yt−1 ; θL1 ), and a sta-
tionary AR(1) process.
Figure 2.15(a) shows the skeleton of (2.64), i.e. the values of the combined
approximation and activation-level function as a function of the input series
{Yt−1 } (blue solid line). The values of B(Yt−1 ; θ Bj ) (j = 1, 2) are displayed
near the bottom of Figure 2.15(a). For approximately Yt−1 < −6.5 both
activation-level functions are almost equal to zero. Around the point Yt−1 =
−6.5, the function B(Yt−1 ; θ B1 ) changes rapidly from 0 to 1, causing a steep
increase in L(Yt−1 ; θ L1 )B(Yt−1 ; θ B1 ) when −6.5 < Yt−1 < −5.6. Then, when
−5.6 < Yt−1 < −2.2, the values of the skeleton drop, due to L(Yt−1 ; θ L1 ).
At Yt−1 = −2.2, there is a slight increase in the values of the skeleton when
the function B(Yt−1 ; θ B2 ) begins to activate. Next, at Yt−1 = −1.7 a further
decline sets in, with a small increase in the values of the skeleton when the
function B(Yt−1 ; θ B1 ) begins to deactivate. Finally, the skeleton goes to zero
at about Yt−1 = 2.
In general, as {Yt } grows in absolute value, the functions B(Yt−1 ; θ Bi ) → 0 (i =
1, . . . , k), and thus {Yt } is driven back to 0. By imposing some weak conditions
on the parameters ω i , and using the above result, it can be proved (Suárez–
Fariñas et al., 2004) that the L2 GNN model is asymptotically stationary with
probability one, even if the model is a mixture of one or two explosive AR
processes.
2.9 ARTIFICIAL NEURAL NETWORK MODELS 65
Figures 2.15(b) shows a T = 200 realization from the L2 GNN model (2.64).
We observe that the series is fluctuating around a fixed sample mean of
−10.780, with a standard deviation of 9.978, suggesting that the process is
asymptotically stationary. There are, however, occasional large negative val-
ues (max{Yt } = 10.109; min{Yt } = −38.428), indicating local nonstationarity.
NCTAR(k; p, . . . , p)q :
k t−1 ; ω
Yt = φ0 + φ Xt−1 + j=1 (ξ0j + ξj Xt−1 )G(X j , cj ) + ε t
φ0 = 0
G(·) = I(·) φ=0
G(·) = B(·)
ξ0j = 0
ξj = 0 ξj = 0 ξj = 0
j = 0)
(or ω
AR(p): Yt = φ0 + φ Xt−1 + εt
k
Yt = φ0 + φ Xt−1 + t−1 ; ω
(ξ0j + ξj Xt−1 )G(X j , cj ) + εt , (2.65)
j=1
where
t−1 ; ω
G(X t−1 − cj ]))−1 ,
ω j X
j , cj ) = (1 + exp(−[
with
t−1 = (Yt−1 , . . . , Yt−q )
Xt−1 = (Yt−1 , . . . , Yt−p ) , X
j = (
ω qj ) , ξj = (ξ1j , . . . , ξpj ) , (j = 1, . . . , k).
ω1j , . . . , ω
Imposing the same parameter restrictions for the AR–NN model given in Section
2.9.1 guarantees identifiability of the NCTAR model. Figure 2.16 shows a flow
diagram of various relationships between the (non)linear AR models.
Loosely speaking, a Markov process is called irreducible if any state j can be reached
from state i in a few steps, and it is termed aperiodic if the number of steps it needs
to return to a state has no period. Furthermore, a Markov chain is ergodic if it is
irreducible and aperiodic.
Any Markov chain has a stationary distribution {πj = P(St = j)}kj=1 satisfying
k
πj = πj pij , (2.66)
j=1
where
1 if St = i,
δti =
0 otherwise,
(i)
with εt = σi2 εt , and {εt } ∼ (0, 1), independent of {St }. So, St denotes the regime
i.i.d.
or state prevailing at time t, one of k possible cases, i.e. it plays the role of {Jt }
in (2.27). In the case k = 1 there is only one state and {Yt , t ∈ Z} degenerates
to an ordinary ARMA process. Adding exogenous variables, such as trends, is a
straightforward extension of (2.67). Another extension of the model is to allow for
generalized autoregressive conditional heteroskedastic (GARCH) errors. Multivari-
ate modeling, including modeling cointegrated processes, is also an option.
Emphasis has been on two-state (k = 2) Markov switching AR (MSA or MSAR)
models with qi = 0 (i = 1, . . . , k) and w1 = p12 , w2 = p21 . The resulting process is
ergodic, with no absorbing states, if 0 < w1 < 1 and 0 < w2 < 1. The stationary
probabilities are π1 = w2 /(w1 + w2 ) and π2 = w1 /(w1 + w2 ) (cf. Exercise 2.7).
Moreover, the system stays in regime i for geometrically distributed time with mean
1/wi .
Figure 2.17(a) shows a realization of (2.68) with {εt } ∼ N (0, 1). A scatter
i.i.d.
plot of Yt versus Yt−1 (not shown here) depicts two linear relationships: one
showing a positive relationship and one with a negative linear relationship
between the two variables.
There are various ways to estimate the MS–AR model. Because {St } is not
observed, the model does not directly give a likelihood function. Let θ =
(φ1 , φ1 , σ12 , σ22 , p11 , p22 ) be the vector of parameters, and F t the σ-algebra
(1) (2)
Figure 2.17: (a) A realization of the MS–AR(1) model (2.67), T = 500; (b) Estimated
smoothed probabilities in state 1 and 2 are plotted as blue and green solid lines, respectively.
where f (Yt |F t−1 , St = j; θ) follows directly from the model, and P(St =
j|F t−1 ; θ) can be obtained recursively from Bayes’ rule:
2
P(St = j|F t−1 ; θ) = P(St−1 = i|F t−1 ; θ)pij , (2.70)
i=1
f (Yt , St = i|F t−1 ; θ)
P(St = i|F t ; θ) =
f (Yt |F t−1 ; θ)
f (Yt |F t−1 , St = i; θ)P(St = i|F t−1 ; θ)
= 2 . (2.71)
i=1 f (Yt |F t−1 , St = i; θ)P(St = i|F t−1 ; θ)
For the simulated data of Figure 2.17(a), we obtain the parameter estimates
Implementation
Implementing an AR–NN model requires several decisions to be made. First, we
need to decide whether the data need scaling. Rescaling the data is linked to initial
values of the weights ω j (j = 1, . . . , k). These weights must vary over a reasonable
range, neither too wide nor too narrow, compared with the range of the data. If
this is not the case, the criterion function will have a number of local minima.
Although, it is difficult to offer a general advice on the choice of scaling, the data in
the training set is often standardized to have zero mean and variance one. Still it is
recommended to train an AR–NN a couple of times, using different initial weights.
For the EEG recordings we decided to use the original data. Since the values of the
inputs are large, but centered around zero, we followed a recommendation in the R
documentation of the nnet package to take the initial values of the weights randomly
from a uniform [−1/ max{|Yt |, 1/ max{|Yt |}] (t = 1, . . . , N ) distribution with N the
size of the training data set, also called the total number of in-sample observations.
The next issue is the choice of G(·). A commonly used activation function is the
logistic function, which we adopt here. Furthermore, we need to choose the number
p of input (lagged) variables, and the number of hidden units k. Various strategies
have been proposed for this purpose. One strategy is to perform a grid search over a
pre-specified range of pairs (p, k) and select the AR–NN on the basis of minimizing
a model selection criterion. Recall, r = (p + 2)k + p + 1 denotes the number of
parameters fitted in the model. Then Akaike’s information criterion (AIC) and the
Bayesian information criterion (BIC) are, respectively, given by
σε2 ) + 2r,
AIC = N log( σε2 ) + r ln(N ),
BIC = N log(
Table 2.1: Comparison of various AR–NN models applied to the EEG recordings; T = 631.
Blue-typed numbers indicate minimum values of a number of “key” statistics.
k p r ε2
σ AIC BIC RMSFE MAFE
0 7 8 3875.15 7937.34 7971.73 65.76 51.29
1 7 17 3833.37 7949.44 8022.53 65.76 51.91
2 7 26 3852.46 7970.15 8081.92 65.27 51.20
3 7 35 3807.98 7981.83 8132.29 65.24 51.60
4 7 44 3744.84 7990.73 8179.89 63.76 49.87
5 7 53 3490.68 7970.50 8198.34 63.80 50.17
0 8 9 3146.67 7810.71 7849.38 51.99 40.43
1 8 19 3091.29 7821.07 7902.71 52.76 40.33
2 8 29 3041.18 7832.19 7956.81 52.23 39.75
3 8 39 3118.29 7865.79 8033.38 51.77 39.95
4 8 49 2702.61 7808.10 8018.66 51.25 39.12
5 8 59 2653.02 7818.05 8071.58 53.04 43.26
An alternative strategy is to select a linear AR(p) model first, using AIC or BIC.
In the second stage hidden units are added to the model. Then, the improvement
in fit is measured again by the AIC and BIC. In practice, we recommend the use
of both order selection criteria. The reason is that the number of parameters in an
AR–NN model is typically much larger than in traditional time series models, the
ordinary AIC does not penalize the addition of extra parameters enough in contrast
to the BIC. Section 6.2.2 contains some alternative versions of AIC which, for large
values of p, penalize extra parameters (much) more severely than AIC.
Subsamples
Since the time-interval between oscillations in the original time series of EEG record-
ings is about 80, we divide the data into two subsamples. The first subsample, used
for modeling, consists of a total of 551 observations. The remaining 80 observations
are used in the second sample for out-of-sample forecasting.
Table 2.1, columns 4 – 6, contains values of σ ε2 , AIC, and BIC for subselection
of AR–NN models fitted to the data in the first subsample. Blue-typed numbers
denote minimum values of these statistics. BIC selects an AR–NN(0; 8) model. This
result is in line with the linear AR(8) model preferred by AIC on the basis of the
complete data set of 631 observations. In particular, the resulting estimated model
is given by
where asymptotic standard errors of the parameters are in parentheses, and where
ε2 = 3080.48. In contrast, AIC picks the AR–
the residual variance is given by σ
2.11 APPLICATION: AN AR–NN MODEL FOR EEG RECORDINGS 71
Table 2.2: EEG recordings. Biases and weights of the best fitted AR–NN(4; 8, . . . , 8) model.
Output
Hidden layer layer
h1 h2 h3 h4 o
Bias α0 → -0.19 0.00 1.03 -0.01 -78.85
Input layer i1 → -16.57 19.59 -4.32 3.19 2.70
i2 → -1.74 10.80 -3.88 2.43 -3.25
i3 → -10.14 5.88 0.63 2.51 2.63
i4 → -6.17 3.40 2.97 1.69 -2.03
i5 → 2.42 -2.65 4.96 0.61 0.96
i6 → -10.64 -4.51 -0.74 1.22 0.56
i7 → -10.87 -1.57 -7.31 1.62 -1.05
i8 → 7.66 -4.56 -17.27 1.69 0.46
NN(4; 8, . . . , 8) model and gives much results in terms of residual variance than
BIC.
Table 2.2 shows the biases and weights of the single-layer AR–NN(4; 8, . . . , 8)
model. Evidently, the weights correspond to the coefficients in the logistic activation-
level functions Gj (·) (j = 1, . . . , 4). As can be seen from the values of ωjo (j = 1, 2),
the first two neurons h1 and h2 have much more effect on the output than the third
and fourth neurons. The inputs at lags 1, 2, 3, 6, 7 and 8 have the largest effect, in
absolute value, on the first hidden layer h1 , whereas all inputs contribute less to the
second hidden layer h2 . Clearly, all inputs have an effect on h3 , but less on h4 . The
signs tell us the nature of the correlation between the inputs to a neuron and the
output from a neuron. The negative values of wij at lags i = 2 (j = 1, 2, 4), i = 4
(j = 1, 2, 3), and i = 7 (j = 1, 2, 3) match the signs of the parameter estimates in the
fitted linear AR(8) model. This is about all that can be said about the weights here.
Indeed, it is unwise to try to interpret the weights any further, unless we reduce the
influence of local minima by using different initial weights.
Forecasting
We consider the forecast performance of the AR–NN(k; p, p) models in a “rolling”
forecasting framework with parameter estimates based on a (551 − p) × p matrix
550−(p−1)
consisting of the in-sample observations: {Yt }550 t=p , {Yt }t=p−1 , . . . , {Yt }t=1
550−1
(here,
p = 7 and p = 8); see Section 10.4.1 for details on various forecasting schemes. We
evaluate the fitted model on the basis of H = 1 to H = Hmax = 80-steps ahead
forecasts. So, we use an 80 × p matrix consisting of the out-of-sample observa-
630−(p−1)
tions: {Yt }630
t=551 , {Yt }t=551−1 , . . . , {Yt }t=551−(p−1) . Finally, the 80 forecast errors are
630−1
summarized in two accuracy measures: the sample root mean squared forecast error
72 2 CLASSIC NONLINEAR MODELS
(RMSFE) and the sample mean absolute forecast error (MAFE); see the last two
columns of Table 2.1. Note that the difference between the AR–NN(5; 8, . . . , 8) and
AR–NN(0; 8) models is minimal, in terms of RMSFE and MAFE.
Section 2.2: D’Alessandro et al. (1974) provide a set of necessary and sufficient conditions
for a Volterra series to admit a BL realization and showed there is a clear-cut method for
determining the Volterra series for a BL system. Brockett (1977) links Volterra series and
geometric control theory by proving that over a finite time interval, a BL model, which is
itself a special case of Wiener’s model, can approximate any “nice” Volterra series with an
arbitrary degree of accuracy. Priestley (1988) discusses how BL models may be regarded
as the natural nonlinear extension of the ARMA model. A considerable amount of research
deals with various properties of BL models; see, e.g., the monographs by Granger and
Andersen (1978a), and Subba Rao and Gabr (1984).
Section 2.3: Haggan and Ozaki (1980, 1981) propose the ExpAR model when p = 2, d = 1,
and φ0 = 0. Earlier, Ozaki and Oda (1978) investigate the ExpAR(1) model with φ0 = 0
and d = 1. Jones (1978) considers methods for approximating the stationary distribution of
nonlinear AR(1) processes, including ExpAR(1) processes.
Section 2.4: The monograph by Nicholls and Quinn (1982) provides a good source of
the early works on RCAR models. These authors also generalize Andel’s (1976) results to
multivariate RCAR models. Amano (2009) proposes a G-estimator (named after Godambe)
for RCAR models. Aue et al. (2006) deal with QML estimation of an RCAR(1) model.
Pourahmadi (1986) presents sufficient conditions for stationarity and derives explicit results
2
for double stochastic AR(1) processes with log(β1,t ) in (2.24) following a stationary Gaussian
process, an AR(1) process, and an MA(q) process.
Section 2.5: Robinson (1977) and Lentz and Mélard (1981) consider estimation of simple
nonlinear MA models using moment methods and ML, respectively. Ashley and Patterson
(2002) use GMM to obtain estimates of the coefficients of a quadratic MA model. Ventosa–
Santaulària and Mendoza–Velázquez (2005) propose a nonlinear MA conditional heteroske-
dastic (NLMACH) model with similar properties as the ARCH-class specifications.
Sections 2.6.1 – 2.6.2: Tong (1977, 1980, 1983, 1990) explores (self-exciting) TAR models
in a number of papers, and two subsequent books; see also Tong (2007). Other influential
publications are: Petruccelli (1992), who shows that threshold ARMA (TARMA) models,
with and without conditional heteroskedastic (ARCH) errors, can approximate SDMs al-
most surely; Tong and Lim (1980), who demonstrate the versatility of SETAR models in
capturing nonlinear phenomena; and K.S. Chan and Tong (1986), who discuss the problem
of estimating the threshold parameter. Nevertheless, as noted by Tong (2011, 2015), these
early publications did not attract many followers. Indeed, the real exponential growth of
the threshold approach, and its extensions took off only in the late 1990s. The impact of
Tong’s SETAR models is enormous across many scientific fields. For instance, Hansen (2011)
provides an extensive list of 75 papers published in the economics and econometrics literat-
ures, which contribute to both the theory and application of the SETAR model. Similarly,
Chen et al. (2011b) review the vast and important developments of the threshold model in
financial applications.
Section 2.6.3: Gonzalo and Wolf (2005) propose a subsampling method for construct-
ing asymptotically valid confidence intervals for the threshold parameter in (dis)continuous
SETAR models. Stenseth et al. (2004) consider an extension of the CSETAR model, which
they call functional coefficient threshold AR model, that specifies some coefficients of the
SETAR model to be functions of some covariates.
Section 2.6.4: Medeiros et al. (2002b) propose SETAR models with unknown multivariate
thresholds. For most practical problems a search over all possible threshold combinations
74 2 CLASSIC NONLINEAR MODELS
(MS–)AR, and MS-state space models. S-Plus script files, using the S-Plus FinMetrics mod-
ule, are available at http://faculty.washington.edu/ezivot/MFTS2ndEditionScripts.
htm. R scripts are available at http://faculty.washington.edu/ezivot/MFTSR.htm. The
R-MSwM package deals with univariate MS–AR models for linear and generalized models
using the EM algorithm.
The website https://sites.google.com/site/marcelocmedeiros/Home/codes offers a
set of MATLAB codes to estimate logistic smooth transition regression models with and
without long memory; see McAleer and Medeiros (2008).
Section 2.9: MATLAB offers a toolbox for the analysis of ANNs. The toolbox NNSYSID
contains a number of m-files for training and evaluation of multi-layer perceptron type
neural networks; see http://www.iau.dtu.dk/research/control/nnsysid.html. There
are functions for working ordinary feed-forward networks as well as for identification of
nonlinear dynamic systems and time series analysis. Various ANN packages are available in
R. For instance, nnet, neuralnet, RSNNS, and darch.
Section 2.10: MS Regress is a MATLAB package for estimating Markov regime switching
models written by Marcelo Perlin and available at https://sites.google.com/site/
marceloperlin/. He also wrote a lighter version of the package in R which, however,
is no longer being maintained; search for FMarkovSwitching on R-forge. The MATLAB
code MS Regress tvtp is for estimating Markov-switching (MS) models with time varying
transition probabilities. Its implementation is based on the code written by Perlin.
Data and software (mainly GAUSS code) for estimating MS models is available from James
D. Hamilton’s website at http://econweb.ucsd.edu/~jhamilton/software.htm. The site
also offers links to software code written by third parties. The R-MSBVAR package includes
methods for estimating MS Bayesian VARs.
Appendix
Nonlinear time series models do not have a Wold representation, however. In these
models, the impact at time t + H of a shock that occurs at time t typically depends on the
history of the process up to the time the shock occurs, on the sign and the size of the shock,
APPENDIX 2.A 77
and on the shocks that occur in intermediate periods t+1, . . . , t +H. This may, for instance,
be deduced from the discrete-time Volterra series expansion (2.3). To avoid these problems,
a natural thing to do is to use the expectation operator conditioned on only the history
and/or shock. Given this choice, the benchmark profile for the impulse response function is
then defined as the conditional expectation given only the history of the process ωt−1 . This
approach leads to the GIRF, originally developed by Potter (1995, 2000) in a univariate
framework and by Koop et al. (1996) in the multiple time series case. For a specific current
shock, εt = δ, and history ωt−1 , the GIRF is defined as
GIRFY (H, δ, ωt−1 ) = E[Yt+H |εt = δ, ωt−1 ] − E[Yt+H |ωt−1 ], (H ≥ 1). (A.2)
GIRFY (H, εt , F t−1 ) = E[Yt+H |εt , F t−1 ] − E[Yt+H |F t−1 ], (H ≥ 1). (A.3)
In general, the GIRF can be defined as a random variable conditional on particular subsets
of shocks (e.g. only negative shocks) and histories (e.g. Yt−1 ≤ 0).16
Note, the above impulse response analysis concerns a single, transitory, shock δ at time
t. An alternative scenario is to measure the effect of a sequence of deterministic shocks
{δ1 , δ2 , . . . , δt , . . .} on {ε1 , ε2 , . . . , εt , . . .}. Recall that a strictly stationary nonlinear time
series process {Yt , t ∈ Z} may be plausibly described by a discrete-time Volterra expansion,
which can be expressed as
Yt = G(εt , εt−1 , . . . , ε1 , ε0 ),
i.i.d.
where {εt } ∼ N (0, 1), ε0 = (ε0 , ε−1 , . . .), and G(·) is a suitably smooth real-valued func-
tion. Again, the goal is to summarize the effect of the shocks on the time evolution of Yt by
a single measure. Since, however, future innovations are unknown, both the benchmark pro-
file and the profile after the arrival of a shock are random variables. Let {εs1 , εs2 , . . . , εst , . . .}
denote a future path for the innovations, where εs1 , εs2 , . . . , εst , . . . are i.i.d. N (0, 1) conditional
on ε0 . The random benchmark profile, or benchmark path, is equal to
Observe that this approach ignores the dependence between the benchmark and perturbed
paths, accounted by the joint distribution of (Yts (ε0 ), Yts (δ, ε0 ), t ≥ 1). Moreover, since the
distribution of {εt } is symmetric, positive and negative shocks will have the same infin-
itesimal occurrence. We refer to Gouriéroux and Jasiak (2005) for an alternative impulse
response analysis, using the concept of nonlinear innovations, which eliminates these prob-
lems and provides straightforward interpretation of transitory or symmetric shocks.
So that, for all t ≥ 2, the effect of a shock as measured by the conditional expectation
of the process {YtD (δ), t ∈ Z} is given by
Clearly, this effect converges toward zero if |φ| < 1, which is a more stringent condition
than the necessary and sufficient condition for stationarity of this model, i.e. E[log(φ+
ψεt )] < 0; see Chapter 3.
BAND–TAR A TAR model with the characteristic feature that the time series
process returns to an equilibrium band rather than an equilib-
rium point; Balke and Fomby (1997).
C–(M)STAR Contemporaneous (multivariate) STAR model. When the mix-
ing weights are determined by the probability that contem-
poraneous latent variables exceed certain threshold variables;
Dueker et al. (2011).
CSETAR Continuous SETAR; Section 2.6.3.
EDTAR Endogenous delay TAR model. The model differs from the
standard TAR implementation by using previously unexploited
information about the length of time spent in regimes. This
allows the construction of “sub-regimes” with “major” regimes.
Parsimony is maintained by tightly restricting parameters across
the sub-regimes; Pesaran and Potter (1997), Koop and Potter
(2003), and Koop et al. (1996).
EQ–TAR Equilibrium TAR. When the process tends towards an equilib-
rium value when it moves outside the threshold bounds; Balke
and Fomby (1997).
GTM Generalized threshold mixed model. A generalization of the
TARX model to take account of non-Gaussian errors; Samia et
al. (2007).
LTVEC Level TVEC model. When the equilibrium error process is
different in each regime; De Gooijer and Vidiella-i-Anguera
(2003b).
M–TAR Momentum TAR, with the thresholding based on the differences
of the time series; Enders and Granger (1998).
MSETAR Multivariate SETAR model. The model allows the threshold
space to be equal to the dimension of the multivariate process us-
ing lagged values of the vector input series; Arnold and Günther
(2001).
MUTARE Multiple SETAR model. The threshold variable is applied to
all the historical observations with a hierarchical substructure
imposed upon the submodels; Hung (2012).
NeTARMA Nested SETARMA model. The model defines primary level sep-
arated regimes using a threshold function which depends on one
source and within each regime of the first stage, two more re-
gimes are nested that are defined by a threshold function which
depends on another source; Section 2.6.6.
PLTAR Piecewise linear threshold AR model. When the coefficients of
the SETAR model are linear functions of the state vector Yt−d
for some delay d; Baragona et al. (2004a).
Q–SETAR Quantile SETAR model. When the existence of different re-
gimes depends on the quantile of the series to be modeled. By
estimating a sequence of conditional quantiles, the model de-
scribes the dynamics of the conditional distribution of a time
series, not just the conditional mean; Cai and Stander (2008).
80 2 CLASSIC NONLINEAR MODELS
Exercises
Theory Questions
2.1 Show that any BL(p, q, P, Q) model may be “converted” into a superdiagonal BL
model by replacing εt with ωt = εt+L for some L ∈ N. Take as examples models
(2.17) and (2.18).
i.i.d.
2.2 Consider the ExpARMA(p, q) model in (2.20) with d = 1. Let {εt } ∼ (0, σε2 ) with
a density function which is strictly positive on Rp+q . Assuming that the DGP is
completely known, express {Yt , t ∈ Z} as a convergent series via repeated substitu-
tion. Discuss briefly how this representation can be used to prove that the process is
invertible if max1≤j≤q (|θj | + |τj |) < 1.
2.3 A Markov process {Yt } is said to be ergodic if starting at any point Y1 = y, the distri-
bution of YT converges to a stationary distribution π(x) = limT →∞ P(YT < x|Y1 = y),
independent of y. It is called geometrically ergodic if this convergence occurs at an ex-
ponential rate. Geometric ergodicity is a concept of stability of the process; it excludes
explosive or trending behavior; see Chapter 3. For the SETAR(2; 1, 1) process
φ1 Yt−1 + εt if Yt−1 ≤ 0,
Yt =
φ2 Yt−1 + εt if Yt−1 > 0,
82 2 CLASSIC NONLINEAR MODELS
necessary and sufficient conditions for geometric ergodicity are φ1 < 1, φ2 < 1 and
φ1 φ2 < 1. These conditions imply the following three possible cases:
(i) |φ1 | < 1 and |φ2 | < 1;
(ii) φ2 ≤ −1 and −1 ≤ φ12 < φ1 < 1;
(iii) φ1 ≤ −1 and −1 ≤ φ11 < φ2 < 1.
Note that in each case, at least one of the two regimes is stationary (|φi | < 1).
(a) Suppose that, in cases (ii) or (iii), the system starts in a nonstationary regime
(i.e., φi < −1). Explain (intuitively) why the system will always move to the
other (stationary) regime in a few steps, i.e., the probability that it will stay in
the nonstationary regime for the next T periods goes to zero as T → ∞. Assume
i.i.d.
{εt } ∼ N (0, σε2 ).
(b) Explain why the system will not be stable if φ1 = −1.25 and φ2 = −0.8 (even
though the second regime is stationary).
(c) Consider a SETAR(k; 1, 1) process. It has been proved that the conditions for
geometric ergodicity are φ1 ≤ 1, φk < 1 and φ1 φk < 1. Explain, using the
appropriate versions of (i) – (iii), why the values of the AR parameters in the
intermediate regimes (φ2 , . . . , φk−1 ) are irrelevant for the stability of the process.
2.4 Consider the SETAR(2; 1, 1) model
φYt−1 + εt if Yt−1 ≤ 0,
Yt =
−φYt−1 + εt if Yt−1 > 0,
i.i.d.
where 0 < φ < 1, and {εt } ∼ N (0, 1). The stationary marginal pdf of {Yt , t ∈ Z} is
given by
(1 − φ2 ) 1/2 1
f (y) = 2 exp − (1 − φ2 )y 2 Φ(−φy),
2π 2
with Φ(·) the standard normal distribution function.
(a) Prove that f (y) is a solution of the equation
0 1
1
f (y) = √ exp − (y − φx)2 f (x)dx
2π −∞ 2
∞ 1
1
+√ exp − (y + φx)2 f (x)dx.
2π 0 2
(b) Prove that the mean and variance of {Yt , t ∈ Z} are respectively given by
2φ2
E(Yt ) = −(2/π)1/2 φ(1 − φ2 )−1/2 , Var(Yt ) = (1 − φ2 )−1 1 − .
π
[Hint:
∞
a b
uΦ(au + b)ϕ(u)du = √ ϕ √ ,
−∞ 1 + a2 1 + a2
∞
b a2 b b
u2 Φ(au + b)ϕ(u)du = Φ √ −√ ϕ √
−∞ 1 + a2 1 + a2 1 + a2
with the standard normal pdf ϕ(u) = (2π)−1/2 exp(−u2 /2).]
EXERCISES 83
i.i.d.
where {εt } ∼ N (0, 1).
(a) Prove that the mean and variance are respectively given by
θ+ − θ− 1
μY = E(Yt ) = √ , Var(Yt ) = 1 + (θ+ )2 + (θ− )2 − μ2Y .
2π 2
(b) Assuming stationarity, it is easy to see that the conditional pdf of {Yt , t ∈ Z},
given εt−1 = u ≥ 0, is normally distributed with mean μ+ = E(Yt |u) = θ+ u
and variance unity. Similarly, the conditional pdf of {Yt }, given εt−1 = u < 0,
is normally distributed with mean μ− = −θ− u and variance unity. Given these
results, prove that the marginal pdf of {Yt , t ∈ Z} is given by
1 −y 2 θ+ y
f (y) = √ exp Φ
{1 + (θ+ )2 }1/2 2π 2{1 + (θ ) }
+ 2 {1 + (θ ) }
+ 2
1 −y 2 −θ− y
+ √ exp Φ .
{1 + (θ− )2 }1/2 2π 2{1 + (θ− )2 } {1 + (θ− )2 }
(c) Consider the case θ+ = −θ− ≡ θ. Using part (b), prove that the marginal pdf
of {Yt , t ∈ Z} is identical to the marginal pdf of the SETAR(2; 1, 1) model in
Exercise 2.3 with φ = θ/(1 + θ2 )1/2 .
2.6 (a) Verify the statement in Section 2.8.1 that the NEAR(1) process is not time-
reversible using the third order cumulants of the process; see for cumulants
(4.2).
(b) Consider the PAR(1) process (2.50) with an exponential marginal distribution
of unit mean. Similar as in part (a), show that the process {Yt , t ∈ Z} is not
time-reversible.
2.7 Let St ∈ {1, 2} follow a two-state Markov chain with switching probabilities 0 < w1 <
1 and 0 < w2 < 1.
(a) Show that the stationary probabilities are π1 = w2 /(w1 +w2 ) and π2 = w1 /(w1 +
w2 ), so that μ = E(St ) = 1 + π2 = (2w1 + w2 )/(w1 + w2 ).
(b) Show that the process {St − 1} is an i.i.d. Bernoulli sequence if w1 + w2 = 1.
(c) Show that E(St |St−1 , St−2 , . . .) = μ(1 − φ) + φSt−1 , with φ = 1 − w1 − w2 , so
that {St } follows an AR(1) process.
2.8 Let {Pt } denote the price of an asset at time t (not paying dividend), then the con-
tinuously compound return, or log-return (often called return), is defined as
Pt
rt = log(1 + Rt ) = log = pt − pt−1 ,
Pt−1
84 2 CLASSIC NONLINEAR MODELS
where Rt = (Pt − Pt−1 )/Pt−1 is the one-period simple return, and pt = log Pt . The k-
k−1
period return is the sum of the one-period log-returns: rt [k] = pt − pt−k = j=0 rt−j
(k = 1, 2, . . .). Now, assume that {rt , t ∈ Z} follows the TGARCH(1, 1) model rt =
2 i.i.d.
Yt = σt εt , with σt2 = α0 + α1 + γ1 I(Yt−1 < 0) Yt−1 2
+ β1 σt−1 and {εt } ∼ (0, 1),
independent of σt , with E(ε3t ) = 0. The parameters satisfy α0 > 0, α1 ≥ 0, β1 ≥ 0
and γ1 > 0. Assume that the parameters also satisfy conditions such as σY2 = Var(Yt )
and E(|Yt |3 ) < ∞.
(a) Show that the (one-period) returns rt [1] = rt = Yt have skewness zero, i.e.
E(Yt3 )
τY = = 0.
σY3
(b) Obtain an expression for the skewness of the two-period returns rt [2] = Yt +Yt−1 ,
and show that it is negative if γ1 > 0.
2.9 The file eeg.dat contains the EEG recordings used to estimate the AR–NN models in
Section 2.11. Use the data to replicate the results reported in Tables 2.1 and 2.2.
[Note: The results need not be exactly as shown in both tables since they depend heav-
ily on the initial weights chosen by random in the R-function nnet, unless set.seed(1).]
2.10 Consider the quarterly U.S. unemployment rate in Example 1.1, which we denote by
{Ut }252
t=1 . If we were to work directly with this series, the assumption of a symmetric
error process would be inappropriate. Various instantaneous data transformations
have been employed in the analysis of {Ut }. These include the logistic transformation,
first differences, the logarithmic transformation, and log-linear detrended. Because
{Ut } takes values
between 0 and 1, we adopt the logistic transformation, i.e., {Yt =
log Ut /(1 − Ut ) }252
t=1 . The transformed series (see Figure 6.2(a)) is now unbounded,
and it is reasonable to assume that the error process {εt , t ∈ Z} of the nonlinear
DGPs considered below is conditionally Gaussian distributed. The data are in the file
USunemplmnt logistic.dat.
2.11 Astatkie et al. (1997) develop a NeSETAR model for an Icelandic streamflow system
for the years 1972 – 1974, i.e. the Jökulsá Eystri in north-west Iceland. The dynamic
system consists of daily data on flow (Qt ), precipitation (Pt ), and temperature (Tt ).
EXERCISES 85
After some experimentation, it was found that the best-fitting NeSETAR model for
Qt is
⎧ ◦
⎪
⎪ 4.82(0.68) + 0.82(0.03) Qt−1 if Qt−2 ≤ 92 m3 /s and T t ≤ −2 C,
⎪
⎪ 1.320.06) Qt−1 − 0.32(0.06) Qt−2
⎪
⎪
⎪ ◦ ◦
⎨ +0.20(0.03) Pt−1 + 0.52(0.10) Tt if Qt−2 ≤ 92 m3 /s and − 2 C < T t ≤ 1.8 C,
Qt = 1.15(0.04) Qt−1 − 0.180.04) Qt−2 + 0.01(0.00) Pt−1
2
⎪
⎪ ◦
⎪
⎪ +1.22(0.13) Tt − 0.89(0.17) Tt−3 if Qt−2 ≤ 92 m3 /s and T t > 1.8 C,
⎪
⎪
⎪
⎩ 49(13.6) + 0.45(0.12) Qt−1
+3.47(1.55) Tt + 3.75(1.71) Tt−1 − 6.08(1.43) Tt−3 if Qt−2 > 92 m3 /s,
(2.72)
where T t = (Tt−1 + Tt−2 + Tt−3 )/3, and with asymptotic standard errors of the
parameter estimates in parentheses. The model includes 16 parameters and produces
a pooled residual variance of 27.4[m3 /s]2 . As a comparison, Tong et al. (1985) and
Tong (1990, Section 7.4.4) use a TARSO model with 42 parameters to the describe
the streamflow data, resulting in a residual variance of 31.8[m3 /s]2 .
The file jokulsa.dat contains the series stored in a 1,086 × 32 matrix with variables
(Qt , Qt−1 , . . . , Qt−10 , Pt , Pt−1 , . . . , Pt−10 , Tt , Tt−1 , . . . , Tt−9 ).
(a) Using the notation introduced in Section 2.6.6, specify the structure of the Ne-
SETAR model (2.72). Interpret the fitted relationship.
(b) Using the supersmoother (function R-supsmu) proposed by Friedman (1984),
regression estimates of Qt on Qt−1 and Qt−2 reveals that there are two linear
pieces in the data, with a threshold estimate r1 = 92 m3 /s.
Using the same method as above, verify the estimated second-stage threshold
◦
r2,1 = −2 C.
(c) Form subset data sets for each regime, and estimate the final model by least
squares. Plot the sample ACF and sample PACF of the normalized residuals
and comment.
From the previous two chapters we have seen that the richness of nonlinear models is
fascinating: they can handle various nonlinear phenomena met in practice. However,
before selecting a particular nonlinear model we need tools to fully understand the
probabilistic and statistical characteristics of the underlying DGP. For instance,
precise information on the stationarity (ergodicity) conditions of a nonlinear DGP
is important to circumscribe a model’s parameter space or, at the very least, to
verify whether a given set of parameters lies within a permissible parameter space.
Conditions for invertibility are of equal interest. Indeed, we would like to check
whether present events of a time series are associated with the past in a sensible
manner using an NLMA specification. Moreover, verifying (geometric) ergodicity is
required for statistical inference.
In this chapter, we address the above topics. To find a balance between the
many works on stationarity and ergodicity of nonlinear DGPs and yet to achieve
results of general practical interest, we first discuss in Section 3.1 the existence of
strict stationarity of processes embedded within the class of stochastic recurrence
equations (SREs). Associated with the SRE, we define the notion of a Lyapunov
exponent which measures the “geometric drift” of a process. This notion plays a
central role throughout the rest of this chapter. In Section 3.2, we briefly mention
a criterion for checking second-order stationarity. Next, in Section 3.3, we focus
on the stationarity (ergodicity) of the class of nonlinear AR-(G)ARCH models as a
special case, and application of the class of SREs. In Section 3.4, we collect some
Markov chain terminologies and relevant results ensuring not only ergodicity, but
also geometric ergodicity of a DGP. In Section 3.5, we discuss ergodicity, global and
local invertibility of NLMA models with special emphasis on the SETMA model.
This section also contains an empirical method to assess the notion of invertibility
in practice.
Two appendices are added to the chapter. Appendix 3.A reviews some basic
properties of vector and matrix norms, while Appendix 3.B discusses the spectral
radius of a matrix.
Ay
s
A
s = sup . (3.2)
y∈Rm ,y=0
y
s
II
Figure 3.1: Strict stationarity parameter region (I ∪ II) based on estimates of the top
Lyapunov exponent, and second-order stationarity parameter region (II) for model (3.6)
i.i.d.
with {εt } ∼ N (0, 1).
Then (3.6) can be written in the form of the SRE (3.1) with
Yt β1 εt−1 β2 ε2t−2 εt
Yt = , At = , Bt = .
Yt−1 1 0 0
When β2 = 0 (i.e., m = 1), the strict stationarity condition based on the top
Lyapunov exponent takes the simple form γ(A) = E(log |β1 εt |) = log |β1 | +
E(log |εt |) < 0. If {εt } ∼ N (0, σε2 ), the condition reduces to σε |β1 | <
i.i.d.
√
2 exp(C/2) = 1.8874 · · · , where C is Euler’s constant.
When m > 1, closed form expressions for γ(A) are hard to obtain, and one
has to resort to MC simulations. Figure 3.1 shows parameter regions for
90 3 PROBABILISTIC PROPERTIES
parameter region II is much smaller than the region for strict stationarity. In
the case of strict-stationarity the curve for γ(A) = 0 passes through the points
(β1 , β2 ) = (0, ±3.7748) and (β1 , β2 ) = (±1.8874, 0).
Clearly, every strictly stationary process which satisfies E
Yt
< ∞ is also second-
order stationary. In the sequel, we focus on the m-vector time series {Yt , t ∈ Z}
generated by (3.1).
in (3.5), the vector process {Yt , t ∈ Z} is a
Given the strict stationary solution
Cauchy sequence in L2 if and only if
( s−1j=0 At−j )Bt−s
2 exists and converges to 0
at an exponential rate as s → ∞. Using the i.i.d. property of {(At , Bt ), t ∈ Z} and
Kronecker product notation, we have
Now, the spectral radius ρ(M) of a square matrix M (see Appendix 3.B) is defined
as
Then, provided E
Bt
2 < ∞, it can be deduced (see, e.g., Nicholls and Quinn, 1982;
Tjøstheim, 1990) that
is a necessary and sufficient condition for the moments of order two to exist. This
condition has a similar implication as that the characteristic polynomial associated
with a linear AR process has no roots on and within the unit circle. If, in addition
At has finite moments of order 2m (m > 1), then a necessary and sufficient condition
ensuring finiteness of higher-order moments is ρ[E{(At )⊗2m }] < 1, where M ⊗m =
M ⊗ · · · ⊗ M (m factors); see, e.g., Pham (1986, Lemma 2). Finally, if {A = At } is
a deterministic process, then from (3.9) it follows that γ(A) = log ρ(A).
3.3 APPLICATION: NONLINEAR AR–GARCH MODEL 91
where 0 <
B(y/
y
, u)
≤ b(1 + |u|) and
C(y, u)
≤ c(y)(1 + |u|) for finite b and
c(x) = o(
y
), and where {εt } are i.i.d. random variables with a density symmetric
about 0 and positive on the real line. We also presume that E(|εt |r ) < ∞ for
some r > 0. Note, (3.8) includes the SRE in (3.1). Cline (2007c) provides explicit
expressions for B(y/
y
, u)
y
in the case of a SETAR model with GARCH errors
depending on past squared values of {Yt }, a nonlinear AR–GARCH model, and a
nonlinear AR model with (possibly nonlinear) GARCH errors.
For stability of (3.8) we need a tool which measures the geometric “drift” of the
process when
Yt−1
is large (and C(Yt−1 , εt ) is negligible). To this end, we define
the top Lyapunov exponent of the process {Yt , t ∈ Z} as
1 1 +
Y
n
γ = lim inf lim sup E log Y0 = y . (3.9)
n→∞ y →∞ n 1 +
Y0
Under some regularity conditions γ < 0 implies geometric ergodicity while the con-
verse γ > 0 ensures that {Yt , t ∈ Z} is transient (explosive); Cline and Pu (1999a,
2001).
Evaluating the double limit in (3.9) by MC simulation is difficult. However, by
establishing ergodicity for a process associated with {Yt , t ∈ Z}, one can express γ
in terms that are more easy to compute. In particular, observe that only the first
term on the right in (3.8) is homogeneous in Yt−1 , and it dominates the behavior
of Yt when
Yt−1
is very large. To exploit this characteristic, and following Cline
(2007c), we consider the homogeneous version of (3.8). That is
Y∗
Yt∗ = B t−1
∗
, ε t
∗
Yt−1
, (3.10)
Yt−1
92 3 PROBABILISTIC PROPERTIES
B(θ, u)
w(θ, u) =
B(θ, u)
, η(θ, u) = , for θ ∈ Θ, u ∈ R.
B(θ, u)
The homogeneous process can be collapsed to Θ:
Yt∗
θt∗ = ∗
= η(θt−1 , εt ). (3.11)
Yt∗
Also, let
Wt∗ = w(θt−1
∗
, εt ).
Evidently the collapsed process {θt∗ } is Markovian. More importantly, {θt∗ } is uni-
formly ergodic (Cline, 2007c) with some stationary distribution, say π. Then the
Lyapunov exponent for {Yt , t ∈ Z}
∗ ∗
is finite. Specifically,
1
n
log Wt∗ .
a.s.
γ = lim
n→∞ n
t=1
Thus, we can estimate γ simply by simulating the collapsed process and obtaining
the sample average of {log Wt∗ }. Alternatively, γ may be determined numerically
through an iterative procedure; see, e.g., Example 3.3.
where the process {εt } ∼ (0, 1), |B(y/|y|, u)| ≤ b(1 + |u|) and C(y, u) ≤
i.i.d.
and
C(y, u) = A(y, u) − B(y/|y|, u)|y|,
we can decompose (3.14) in the form (3.13), where B(·) and C(·) are respect-
ively a homogeneous and a locally bounded function in Yt−1 . Now, analogous
to (3.11), the homogeneous form of (3.13) can be collapsed to the process
{θt∗ = η(θt−1
∗ , ε )} which is a two-state Markov chain on [−1, 1]. Let
t
have the state vector Yt = (Yt , Yt−1 ) and the collapsed process {θt∗ } takes
values on the unit circle in R2 . In addition, there are thresholds located
at arc(θ) = ±π/2 on the unit circle. Since m > 1, one can only evaluate
the Lyapunov exponent either by direct MC simulation or by numerically
analyzing a uniformly ergodic process. Below we show results for γ obtained
by solving numerically an equilibrium equation given by
∗ ∗
Figure 3.2: Strict stationarity parameter regions (black solid line) for a SETAR–ARCH
model, parameter regions for checking the existence of the first moment (blue medium dashed
lines) and second moment (red medium dashed lines), and parameter regions for second-order
stationarity (green solid lines) of {Yt = (Yt , Yt−1 ) , t ∈ Z}.
Suppose γ < 0, then it is often useful to determine which moments are finite
for the stationary distribution of {Yt , t ∈ Z}. For general nonlinear AR–GARCH
processes it can be shown (Cline, 2007a) that the rth moment exists when there is
a bounded, positive function λ(θ) such that
λ(θ ∗ )
sup E (Wt∗ )r θ0∗ = θ < 1 for r > 0. (3.18)
θ∈Θ λ(θ)
Figures 3.2(a) and (b) show parameter regions for strict stationarity (black solid
lines) of the SETAR–ARCH model in (3.16) with in each case six parameters fixed
and the remaining two parameters varying over a range of values. The figures also
contain parameter regions for checking the existence of the first- and second moments
of {Yt , t ∈ Z}. Obviously, both regions are contained within the strict-stationarity
region though covering a more restrictive set of parameter values. Indeed, we observe
(1)
that for strict-stationarity the leading coefficient φ1 can be quite negative provided
the other leading coefficient is not too big. Note that the stability region in Figure
3.2(b) closely resembles the stability region of a SETAR(2; 1, 1) model given in Figure
3.4 DEPENDENCE AND GEOMETRIC ERGODICITY 95
(1) (2)
3.3(a). Presumably the values of φ1 and φ1 dominate the general pattern of the
stability region while the other parameters have hardly any effect.
Figures 3.2(a) and (b) also show the parameter regions for second-order station-
arity (green solid lines). The corresponding condition follows from (3.7) in Section
3.2, and is given by
(1) (2) (1) (2)
2 (1) (2)
max(|φ1 |, |φ1 |) + max(|φ2 |, |φ2 |) + max{α1 , α1 }
(1) (2)
+ max{α2 , α2 } < 1. (3.19)
We see that (3.19) is far too restrictive compared to the strict stationarity condition.
Imposing them would unduly limit the dynamics permitted by the SETAR–ARCH
model. In fact, as we see from the shape of the region enclosed by the red medium
dashed lines, some parameters may have values much bigger than one, while the
second moment still is finite.
1 I J
β(k) = sup |P(Ai ∩ Bj ) − P(Ai )P(Bj )|, (3.21)
2 Ai ∈F 0 ,Bj ∈F ∞
−∞ k i=1 j=1
where in the definition of β(k) the supremum is taken over all pairs of finite partitions
{A1 , . . . , AI } and {B1 , . . . , BJ } of Ω such that Ai ∈ F 0−∞ for each i and Bj ∈ F ∞
k
for each j.
The quantities α(k) and β(k) are called mixing coefficients . The process {Yt , t ∈
Z} is called strongly mixing (or α-mixing) if limk→∞ α(k) = 0, and β-mixing (or
96 3 PROBABILISTIC PROPERTIES
Assume π(E) = 1. If there exists a finite measure with property (3.22) and we run
a Markov chain with initial probability distribution π, then the resulting process is
stationary and its marginal distribution is π at any time point t.
It is of course not yet clear whether the distribution of {Yt , t ∈ Z} converges
towards an invariant distribution π. If such a convergence happens with respect to
the total variation norm
·
V , and with a fixed geometric rate, the Markov chain
{Yt , t ∈ Z} is called geometrically ergodic. This means that there exists a constant
0 < ρ < 1 such that ∀y ∈ Rm ,
lim ρ−t
P t (y, ·) − π(·)
V = 0 (3.23)
t→∞
3.4 DEPENDENCE AND GEOMETRIC ERGODICITY 97
for almost all initial states y ∈ Rm provided π(·) < ∞. Thus, a geometrically ergodic
stationary Markov chain is also strongly mixing with geometric rate. More precisely,
for α(k) as defined by (3.20), we have α(k) ≤ Kρk for some constants K > 0 and
ρ ∈ (0, 1). If (3.23) holds when ρ = 1, then {Yt , t ∈ Z} is said to be Harris ergodic.
As usual in the theory of Markov chains, we restrict attention to the case of
irreducible Markov chains. Let ϕ be a non-trivial (i.e. ϕ(Rm ) > 0) σ-finite measure
on (Rm , E). Then the Markov process defined above is called ϕ-irreducible if ∀C ∈ E
with ϕ(C) > 0, ∀y ∈ Rm ,
∞
P t (y, C) > 0.
t=1
This simply states that almost all parts of the state space are accessible from all
points y of Rm . Further, a Markov chain is a (weak) Feller chain if for every bounded
continuous function g(·) on E = R the function
is also continuous in y ∈ E.
Next, we state a result due to Feigin and Tweedie (1985, Thm. 1) which ensures
geometric ergodicity. Suppose that (i) {Yt , t ∈ Z} is a Feller chain, and there exist
a measure ϕ and a compact set C with ϕ(C) > 0 such that
(ii) {Yt , t ≥ 0} is ϕ-irreducible;
(iii) There exists a non-negative continuous function V : E → R satisfying V (y) ≥ 1
∀y ∈ C and for some δ > 0,
Figure 3.3: Stationarity region of a SETAR(2; 1, 1) model; (a) d = 1, and (b) general d.
(i) Lebesgue’s dominated convergence theorem ensures that for any bounded
continuous function V (·), E{V (Yt )|Yt−1 = y} is continuous in y, and
hence the Markov chain is Feller.
(ii) Given Y0 = y, the law of Y1 = A1 y + B1 admits a strictly positive
density with respect to Lebesgue measure μLeb , and so the chain is φ-
irreducible with φ = μLeb .1
(iii) The condition E
A1
< 1 for some > 0 implies E(log
A1
) < 0, using
Jensen’s inequality. Now, without loss of generality, let ∈ (0, 1] and
V (y) = 1 + |y| , y ∈ Rm .
Obviously,
for some constant 1 − δ > E
A1
. This proves the so-called drift condi-
tion and completes the argument.
Thus, the stationary solution (3.5) of the SRE is geometrically ergodic, and
hence strongly mixing with geometric rate.
Lesbesgue measure μLeb is a unique positive measure on the class R of linear Borel sets. It is
1
specified by the requirement: μLeb (a, b] = b − a ∀a, b ∈ R (a ≤ b). Lebesgue measure on the class
Rm of m-dimensional Borel sets is constructed similarly using the area of bounded rectangles as a
basic definition; see, e.g., Billingsley (1995, Chapter 2).
3.4 DEPENDENCE AND GEOMETRIC ERGODICITY 99
Figure 3.3(a) shows the geometric ergodicity (strict stationarity) region for
SETAR(2; 1, 1) models with d = 1; see Table 3.1. Note that in contrast with
the stationarity of linear AR models, the region is unbounded. Moreover, we
see a much larger region of stationarity than the region |φ1 | < 1 and |φ2 | < 1
which would result if only sufficient conditions for stationarity were applied.
Figure 3.3(b) shows the stationarity region in the parameter space implied by
SETAR(2; 1, 1) models with d ≥ 2. Comparing these two plots, we see clearly
the effect of the delay parameter d.
In Markov chain terminology, it can be proved (Guo and Petruccelli, 1991)
that the SETAR(2; 1, 1) model with d ≥ 1 is positive Harris recurrent in the
blue-striped “interior” and “boundary” areas; and it is transient (explosive)
in the “exterior” of the parameter space. The SETAR(2; 1, 1) model is null
recurrent on the boundaries, and regular in the strict interior parameter space
which in this case implies that the process {Yt , t ∈ Z} is geometrically ergodic.
In other words, the limit cycle behavior of the SETAR model arises from the
alternation of explosive, dormant, and rising regimes.
Table 3.1 gives an overview of necessary and sufficient conditions for geometric
ergodicity of some threshold models. The proofs are given under the assumption that
{εt } is i.i.d. with positive pdf over the real line R and E|εt | < ∞. If appropriate, it is
(i) (i)
also assumed that for each i {εt } are i.i.d. and {εt , i = 1, . . . , k} are independent.
Finally, note that for the general SETARMA model d ≤ p, since if d > p one can
(i)
introduce additional coefficients φj = 0 for i > p.
Observe that for the SETARMA model, stationarity is completely determined
by the linear AR pieces defined on the two boundary threshold regimes. That is, the
MA part of the model does not affect stationarity. In fact, a pure SETMA model
is always stationary and ergodic as is the linear MA model. Another interesting
feature of SETARMA models is that overall (global) stationarity does not require
the model to be stationary in each regime. The ergodicity conditions given by Liu
and Susko (1992) and Lee and Shin (2001) illustrate this remark; see, also, Exercise
2.2. In general, distinguishing between local and global stationarity and between
local and global invertibility (see Section 3.5) is important for physical motivation
and for application of nonlinear time series models. However, it is quite complicated
to derive explicit (analytical) conditions for local stationarity and local invertibility.
100 3 PROBABILISTIC PROPERTIES
Table 3.1: Necessary and sufficient conditions for geometric ergodicity of SETAR(MA)
models.
Reference Model Ergodicity conditions
Petruccelli and Woolford (1984) SETAR(2; 1, 1): Yt = φ1 I(Yt−1 ≤ 0) φ1 < 1, φ2 < 1, φ1 φ2 < 1
+φ2 I(Yt−1 > 0) + εt (necessary and sufficient)
(i) (1) (k) (1) (k)
Chan et al. (1985) SETAR(k; 1, . . . , 1): Yt = ki=1 {φ0 φ1 < 1, φ1 < 1, and φ1 φ1 < 1
(i) (i)
+φ1 Yt−1 + εt }I(Yt−1 ∈ R ) (i) (sufficient)
s t
Chen and Tsay (1991) (1) SETAR(2; 1, 1): Yt = φ1 I(Yt−d ≤ 0) φ1 < 1, φ1 φ2 < 1, φ1d φ2d < 1,
td sd
+φ2 I(Yt−d > 0) + εt (d ≥ 2) φ1 φ2 < 1 where td , sd ∈ N,
td = sd + 1, and sd = 12 , 33 , 74 , 15 ,
316 , 637 , 18 , 339 , 310
(necessary and sufficient)
Brockwell et al. (1992) (2) SETAR(k;
p, .(i)
. . , p) - MA(q): ρ maxi {|A(i) |} < 1 (i = 1, . . . , k)
(i)
Yt = ki=1 {φ0 +φ1 Yt−d + εt with A =
(i)
q φ1
(i)
··· φp
(i)
+ j=1 ψj εt−j }I(Yt−d ∈ R(i) )
Ip−1 0(p−1)×1
(sufficient)
(i) (i) (i) (1) (k) (1) (k)
Liu and Susko (1992) Yt = ki=1 {φ0 + φ1 Yt−d + εt φ1 < 1, φ1 < 1, φ1 φ1 < 1
q (i) (i)
+ j=1 ψj εt−j }I(Yt−d ∈ R ) (sufficient)
(i)
(1) (k)
φ1 ≤ 1 and φ1 ≤ 1 (necessary)
Amendola et al. (2009a) SETAR(2; p, q, p, q): maxi {ρ(A(i) )} < 1 (i = 1, 2)
(i) (i)
Yt = 2i=1 {φ0 + pj=1 φj Yt−j + εt sufficient, but weaker than
q (i)
+ j=1 ψj εt−j }I(Yt−d ∈ R(i) ) ρ maxi {|A(i) |} < 1
k (i) pi
Niglio and Vitale (2010a) SETARMA(k; 1, q, . . . , 1, q): i=1 |φ1 | < 1 (i = 1, . . . , k),
k (i) (i)
Yt = i=1 {φ1 Yt−d + εt where pi = E[I(Yt−d ∈ R(i) )], with
(i) (i)
+ qj=1 ψj εt−j }I(Yt−d ∈ R(i) ) 0 < pi < 1 (3) and ki=1 pi = 1
(sufficient)
Lee and Shin (2000) MTAR(2; 1, 1): Yt = φ1 Yt−1 I(Yt−1 φ1 < 1, φ2 < 1, φ1 φ2 < 1, φ1 φ22 < 1,
≥ Yt−2 ) + φ2 Yt−1 I(Yt−1 < Yt−2 ) and φ21 φ2 < 1
+εt (sufficient)
Lee and Shin (2001) MTAR(2; 1, 1) with partial unit φ1 = 1, |φ2 | < 1 or
roots
|φ1 | < 1, φ2 = 1
(necessary and sufficient)
(1) Lim (1992) derives necessary and sufficient conditions for stability of the deterministic
SETAR(2; 1, 1) model with general d.
(2) Ling (1999) shows that a sufficient condition for strict stationarity of the
(i)
SETARMA(k; p, q, . . . , p, q) model is given by pj=1 maxi |φj | < 1 (i = 1, . . . , k) which is
equivalent to the condition given by Brockwell et al. (1992).
(3) The k-regime SETARMA model becomes a linear ARMA model when pi = 1, and a
k∗ -regime SETAR model (k∗ < k) when pi = 0.
3.5 INVERTIBILITY 101
3.5 Invertibility
The classical invertibility concept for univariate linear time series processes loosely
says that a time series process is invertible when we are able to express the noise
process {εt } as a convergent series of the observations {Yt }, given that the DGP is
completely known. From the theory of linear time series it is well known that the
invertibility concept is pivotal when one tries to recover the innovations from the
observations of a DGP. Indeed, invertibility assures that there is a unique repres-
entation of the model which can be used for forecasting. In this section, we discuss
conditions for the global and local invertibility of nonlinear DGPs, where in the
latter case the boundary region is a part of the possible parameter space.
3.5.1 Global
To begin with, suppose {Yt , t ∈ Z} is generated by the stationary and ergodic
NLARMA(p, q) model
where {εt } ∼ (0, σε2 ), and g(·; θ) is a known real-valued function for a known
i.i.d.
parameter vector θ. For nonlinear time series there exist (at least) three concepts
of invertibility.
et = εt − εt . (3.27)
E[e2t ] → 0 as t → ∞. (3.28)
E|et |r → c as t → ∞, (3.30)
102 3 PROBABILISTIC PROPERTIES
Table 3.2: Necessary and sufficient conditions for invertibility of NLMA-type models (1) .
Reference Model Condition
p p
Ling and Tong (2005) SETMA(2;
q p, q): Yt = pi=1 φi εt−i i=1 |φi | < 1, and i=1 |φi + ψi | < 1
+ i=1 ψi I(Yt−d ≤ r)εt−i + εt where ψi = 0 for i > q
(sufficient)
k
Ling et al. (2007) SETMA(k; 1, . . . , 1): Yt = {ψ0 {|ψ0 + ψi |FY (ri )−FY (ri−1 ) } < 1 (2)
i=1
+ ki=1 ψi I(ri−1 < Yt−1 ≤ ri )}εt−1 and not invertible if
k
i=1 {|ψ0 + ψi |
+εt FY (ri )−FY (ri−1 ) } > 1,
E[e2t ] → 0 as t → ∞, (3.31)
Hallin (1980) shows that in nonlinear models with constant coefficients defini-
tions (i) and (ii) are equivalent. When the coefficients are not time dependent
and the DGP is linear, (3.32) coincides with the classical invertibility condi-
tion.
3.5 INVERTIBILITY 103
Table 3.3: Necessary and sufficient conditions for stationarity and invertibility of BL
i.i.d.
models. In all cases {εt } ∼ (0, σε2 ) unless otherwise specified.
s = max(p, Q)
(sufficient)
p
Liu (1990) Yt = φi Yt−i + εt + θεt−1 E{log pj=1 B(t − j) } < 0 with
i=1 Q
p Q φ1 + Q
+ B(t)= v=1 ψ1v εt−v · · · φp + v=1 ψpv εt−v
u=1 v=1 ψuv Yt−u εt−v Ip−1 0(p−1)×1
with E{log+ |ε1 |} < ∞ (sufficient)
Marek (2005) Yt = εt +(a + βYt−2 )εt−1 , β 2 σε2 < (1 − a2 )/2
Yt = (a+βεt−1 )εt +αεt , |α| < |a| and β < (|a| − |α|)/3
(a = 0, α = 0, β > 0), |εt | < 1 (sufficient)
(1) The condition reduces to the sufficient condition of Subba Rao (1981) for a BL(p, 0, p, 1) model.
In the case p = Q = 1 the condition becomes |ψ| < exp(−E log |Yt |), earlier obtained by
Pham and Tran (1981).
Figure 3.4: Invertibility regions of the RCMA(1) model with At,1 following respectively a
U (a − θ, a + θ) distribution (blue solid curve), a N (a, θ 2 ) distribution (red solid curve), and
a Student t6 (a, θ) distribution (green solid curve).
play a role to ensure invertibility of the SETMA model. For the SETMA model there
is no difficulty in extending the results to the case where the data are generated by
a SETARMA model.
∞
1 # −y 2 $ a
E(log |At,1 |) = log θ + √ exp log y + dy. (3.35)
−∞ 2π 2 θ
Figure 3.4 shows the parameter regions for both sequences {At,1 } using the
invertibility condition E(log |At,1 |) < 0. Note that in the case of (3.34) the
blue solid curve passes through the point (a, θ) = (0, e), while in the case of
(3.35) the red solid curve goes through the point (0, 1.8874 · · · ).
Figure 3.4 also includes the parameter region for invertibility of the RCMA(1)
model when {At,1 } ∼ t6 (a, θ) distributed (green solid curve), which is a
i.i.d.
3.5 INVERTIBILITY 105
(ii) Replace εt by εt for t = T + 1, . . . , N and use past values Yt−k (k = 0, . . . , p),
and εt−k (k = 0, . . . , q), to generate a new set of observations {Yt }N
t=T +1 .
et = Yt − Yt }N
(iii) Calculate { , where Yt are the out-of-sample fitted values.
τ
t=T +1
−1
Estimate E(et ) by (τ −T )
2
t2 . If for all values of τ = T +1, . . . , N ,
t=T +1 e
this sequence does not exceed a pre-fixed value the process {Yt , t ∈ Z} is said
to be empirically invertible, otherwise it suggests non-invertibility.
form
{εt } ∼ N (0, 1),
i.i.d.
Yt = εt + βεt−1 + ψF (εt−1 )εt−1 , (3.36)
where F (εt−1 ) = [1 + exp(−γεt−1 )]−1 , and γ > 0. No explicit invertibility
conditions have yet been derived for this model.
For T = 100, we generated 1,000 time series {Yt }100
t=1 . Dropping the first 150
observations to avoid start-up effects and using Algorithm 3.1 with N = 1,000,
we computed a sequence of estimates of E(e2t ). Next, the process was classified
as empirically invertible if for all values τ = T + 1, . . . , N the values of the
sequence did not exceed 10 −10 .
Figures 3.5(a) and (b) show curves of the proportion of non-invertible models
as a function of the parameter ψ for three different values of γ. Note that
the empirical invertibility region remains the same as γ increases when β = 0,
while the region reduces when β = 0.8. For γ = 0.5 the width of the empirical
region is about the same in both figures. For larger values of γ the size of
the invertibility region becomes smaller when β = 0.8. Moreover, the curves
show a clear difference in the proportion of non-invertible models for ψ > 0 as
opposed to ψ < −2.
Throughout the previous part, we assumed that (3.25) is an ergodic strictly
stationary process. Within a Markov chain framework this requires verifying the
irreducibility condition as a part of the Feigin–Tweedie result to establish geometric
ergodicity. For general nonlinear MA models this is a non-trivial problem. Interest-
ingly, Li (2012) derives an explicit/closed form of the unique strictly stationary and
ergodic solution to the multiple-regime SETMA model without resorting to Markov
chain theory. Using a different approach, his work generalizes results of Li, Ling,
and Tong (2012) for two-regime SETMA models. The main idea is to re-formulate
the model as a SRE and adopt the notion of the top Lyapunov exponent as we
discussed in Section 3.1.
Consider a k-regime SETMA model of order q which we write in the form
(k)
k−1
(i) (k)
Yt = at + (at − at )I(Yt−d ∈ R(i) ), (3.37)
i=1
where
(i) (i)
q
(i)
at = ψ0 + εt + ψj εt−j , (i = 1, . . . , k).
j=1
Here, {εt } is assumed to be a strictly stationary and ergodic process rather than the
usual and more restrictive assumption that {εt } is i.i.d. It follows from (3.37) that
(k)
k−1
# (j) (k) $
I(Yt ∈ R(i) ) = I(at ∈ R(i) )+ I(at ∈ R(i) )−I(at ∈ R(i) ) I(Yt−d ∈ R(j) ),
j=1
(i = 1, . . . , k − 1). (3.38)
3.5 INVERTIBILITY 107
and
(j) (k)
At = (aij,t ) with aij,t = I(at ∈ R(i) ) − I(at ∈ R(i) ) (i, j = 1, . . . , k − 1).
Then
It = At It−d + at . (3.39)
which is of the form (3.5). So, a unique strictly stationary and ergodic solution of
{Yt , t ∈ Z} is given by
(k) (1) (k) (k−1) (k)
Yt = at + (at − at , . . . , at − at )It−d , a.s., (3.41)
s−1
where It−d = ∞ s=1 ( i=1 At−id )at−sd . It is immediate that (3.41) does not require
any restriction on the coefficients of the process, which is different from SETAR
models.
3.5.2 Local
Within the setting of a nonlinear stochastic difference equation, it is possible (Chan
and Tong, 2010) to link local invertibility with the stability (in a suitable sense)
of an attractor in a dynamical system. Let et = (et , . . . , et−q+1 ) be the vector
of reconstruction errors, and εt = (εt , . . . , εt−q+1 ) (q > 1). Then (3.25) can be
rewritten as a homogeneous (deterministic) equation associated with the SRE (3.1)
in which Bt is replaced by the zero vector, i.e.
et = F (et−1 , εt−1 ; θ)
= g(εt−1 , . . . , εt−q ; θ) − g(et−1 + εt−1 , . . . , et−q + εt−q ; θ), et−1 , . . . , et−q+1 ,
(3.42)
t
et = 0 + Ḟs e0 , (3.43)
s=1
1 t
lim log
Ḟs
= γ(Ḟ). (3.44)
t↑∞ t
s=1
When q = 1, γ(Ḟ) = E(log
Ḟ1
), by the independence of the Ḟs ’s. For q > 1 a
sufficient local invertibility
condition
can be obtained using the following property
of a matrix norm:
s As
≤ s
As
for a sequence of regular matrices As in
Rq×q . Then, assuming that {Ḟs } is a function of a stationary and ergodic process,
we have
t
m
−1 −1
t E log(
Ḟs
≤ t p E(log(
Ḟj
) + t−1 Er (log(
Ḟ1
),
s=1 j=1
where {εt } ∼ (0, σε2 ). From (3.41), we know that {Yt , t ∈ Z} is strictly
i.i.d.
Figure 3.6: Plot of a strictly stationary and ergodic time series generated by a globally
invertible, but locally non-invertible SETMA(2; 2, 2) model; T = 5,000.
Ling et al. (2007) show that for the SETMA(k; 1, . . . , 1) model Yt = {ψ0 +
k
i=1 ψi I(ri−1 < Yt−1 ≤ ri )}εt−1 + εt the spectral radius ρ(Ḟ) is given by
k
ρ(Ḟ) = exp γ(Ḟ) = {|ψ0 + ψi |FY (ri )−FY (ri−1 ) },
i=1
the regime switching depends directly on a hidden Markov chain and only indirectly on the
current state of the process itself, i.e. the process {(At , Bt ), t ∈ Z} in (3.1) is no longer i.i.d.
Section 3.3: Goldsheid (1991) provides a CLT which may be used to construct asymp-
totic confidence bands for estimators of the top Lyapunov exponent, while Gharavi and
Anantharan (2005) derive an upper bound for γ(·). In a review paper, Lindner (2009) ad-
dresses the question of strictly stationary and weakly stationary solutions for pure GARCH
processes.
Section 3.4: In the early 80s the most part of the literature consider sufficient, and rarely
necessary, conditions for stationarity and ergodicity for nonlinearities in the conditional
mean; see, e.g., Chan and Tong (1985), Liu (1989a, 1995), Pham (1986), Pham and Tran
(1985), Liu and Brockwell (1988) and the references therein. During the last two decades
the focus is mainly on studying conditions for combined models with nonlinearities in both
the conditional mean and the conditional variance; see, e.g., Fonseca (2004) and Chen et al.
(2011b) for references to the main contributions. More recent developments are by Chen
and Chen (2000), Ferrante et al. (2003), Fonseca (2005), Liebscher (2005) and Meitz and
Saikkonen (2008, 2010), among others.
Section 3.4.2: Meyn and Tweedie (1993, Appendix B) propose a four-step procedure to
classify a SETAR model as being ergodic, transient, and null recurrent. This procedure may
also serve as a template for analyzing other nonlinear time series models.
Section 3.5: In the case when (3.25) has time dependent coefficients, Hallin (1980) gener-
alizes the notion of invertibility in (3.31). Using the solution to the SETMA process (3.41),
Li (2012) and Li, Ling, and Tong (2012) derive explicit expressions for the moments and
ACF of some special TMA models. Amendola et al. (2006a, 2007) give examples of moment
and ACF expressions of SETARMA models. Chen and Wang (2011) investigate some prob-
abilistic properties of a combined linear–nonlinear ARMA model with time dependent MA
coefficients.
Appendix
The function x →
x
p is known as the Lp -normed linear space. The most common
linear spaces are the one-norm, L1 , and the two-norm, L2 , where p = 1 and p = 2,
respectively.
(ii) The infinity-norm:
Let x = (x1 , . . . , xn ) be a vector in Rn . Another standard norm is the infinity, or
maximum, or supremum, norm given by the function
x
∞ = max (|xi |). (A.5)
1≤i≤n
The vector space Rn equipped with the infinity norm is commonly denoted L∞ .
(iii) Continuous linear functionals:
Let V = C[a, b] be the space of all continuous functionals f (·) on the finite interval
[a, b]. Then a natural norm is
b 1/p
f
p = |f (x)|p dx , p ≥ 1, (A.6)
a
Matrix norms:
Suppose {Rn ,
x
p } is a normed linear space with
x
p some norm. Let A = (aij )m×n be
a real matrix. Then the norm of A, subordinate to the vector norm
x
p , is defined as
Ax
p
A
p = sup = sup
Ax
p , x ∈ Rn , Ax ∈ Rm . (A.7)
x=0
x
p xp =1
APPENDIX 3.A 113
So,
A
p is the largest value of the vector norm of Ax in the space V = Rn normalized over
all non-zero vectors x. In particular,
1/2
A
1 = max |aij |,
A
2 = maximum eigenvalue of (A A) .
j
i
The norm
A
2 is often called the spectral norm. When p = 1 and 2, the matrix norm
satisfies the following four properties:
Here, (A.8) – (A.10) are generalizations of the three properties (A.1) – (A.3). Property
(A.11) is a direct consequence of the definition (A.4). A special case of (A.11) is
AB p ≤ A p B p , (A.12)
which is a simple but often useful property. Another special case of (A.11) is
If some or all of the matrices Dv = In , as with the so-called companion matrix , then by
(A.13),
Bj
≥ 1. So, the condition leading to (A.13) is not fulfilled. One can get around
this problem by multiplying together sufficiently many Bj ’s before taking the norm.
114 3 PROBABILISTIC PROPERTIES
ρ(A) ≤ A p , (B.2)
for all subordinate matrix norms. This property can be easily proved. Note that ρ(A) is
not a norm since it can be shown that ρ(A + B) ≤ ρ(A) + ρ(B).
The following properties are often useful. For any positive integer m, and a constant
c > 0, we have
m
|(Am )ij | ≤ c ρ(A) , ∀i, j (B.3)
n
ρ(A) ≤ max |aij | ≤ n max |aij |, (B.4)
1≤i≤n 1≤i,j≤n
j=1
Exercises
Theory Questions
Yt = {φ + ξ exp(−γYt−1
2
)}Yt−1 + εt , (|φ| < 1 <, γ > 0),
where {εt } are i.i.d. random variables, each having a strictly positive and continuous
density f (x) = (1/2) exp(−|x|). Prove that {Yt , t ∈ Z} is geometrically ergodic and
E|Ytm | < ∞ ∀m ∈ Z+ .
Yt = εt + ψ(εt−1 ) εt−1 ,
k
i=1 β FR(i) (ε) with FR(i) (·) the characteristic function of set R
(i) (i)
where ψ(ε) =
(i = 1, . . . , k). Assume |β (i) | ≤ γ < 1 and E|εt |m ≤ c < ∞ (m ∈ Z+ ), where γ and c
are real positive constants. Furthermore, assume that the residual ε0 = 0.
Show that the process {Yt , t ∈ Z} is invertible in the sense that lim sup t→∞ E|et |m ≤
c∗ , where {et } are the reconstruction errors, and c∗ < ∞ is some constant.
where β = 0. Granger and Andersen (1978a, p. 28) claim that this model is never
invertible with respect to the non-zero value of the parameter β.
(a) Show that under the condition |β| < (C + log 2)/4 the model is locally invertible
where C is Euler’s constant.
(b) Consider Algorithm 3.1 with N = 1,000. Set T = 50 and T = 100. Then,
using 1,000 MC replications, show that the model is empirically invertible for
|β| values smaller than approximately 0.85.
Using the above model, Terdik (1999, p. 207) obtains the following estimation results
for the magnetic field data (Example 1.3):
(a) Verify that the fitted BL model is a weakly (second-order) stationary process,
assuming it is first-order stationary.
(b) Show that (3.46) is invertible if φ and ψ satisfy the condition
Yt = Z1,t−1 + θ0 εt ,
(Kristensen, 2009)
(a) Using Algorithm 3.1 with N = 1,000, obtain a graphical representation of the
empirical invertibility region for a simulated time series of size T = 100, using
1,000 MC replications.
(b) Wecker (1981) derives the following sufficient invertibility conditions: |β + | < 1
and |β − | < 1. Compare and contrast the resulting invertibility region with
the one obtained in part (a). Suggest a necessary and sufficient condition for
invertibility.
3.7 (a) Consider the asMA(1) model in Exercise 3.6. Rewrite the model in the form
Yt = εt + β(εt−1 ),
2
where β(εt−1 ) = i=1 βi I(εt−1 ∈ Si )εt−1 with β1 = β + , β2 = β − , S1 = [0, ∞)
and S2 = (−∞, 0). Verify the invertibility condition E|et | → 0 as t → ∞. Show
that the corresponding invertibility region is given by
3.8 Subba Rao and Gabr (1984, pp. 211 – 212) consider the monthly West German unem-
ployment data (Xt ) for the time period January 1948 – May 1980 (389 observations).
They use the first 365 observations of the series Yt = (1 − B)(1 − B 12 )Xt for fitting
a subset BL model, and the last 24 observations for out-of-sample forecasting. It is
therefore vital that the fitted model is invertible. The best fitted subset BL model is
given by
Assuming the above model is correctly specified, check the empirical invertibility of the
fitted BL model using Algorithm 3.1 with N = 1,000. The complete (undifferenced)
data set (German unemplmnt.dat) is available at the website of this book.
Chapter 4
FREQUENCY-DOMAIN TESTS
have long been preferred in applications. However, these test statistics tend to have
low power and require the specification of a smoothing or window-width parameter.
Consequently, various improvements and modifications of the Hinich bispectral test
statistics have been proposed; see Section 4.4 for a brief overview. First, in Sec-
tion 4.4.1, we apply goodness-of-fit techniques to the asymptotic properties of the
estimated bispectrum, resulting in new test statistics with increased power. In the
following subsection, we describe a method to eliminate the arbitrariness concern-
ing the selection of the smoothing parameter. In Section 4.4.3, we discuss another
improvement based on a bootstrap algorithm, which approximates the finite-sample
null distribution of Hinich’s test statistics.
As we saw in Section 1.1, the differences between linear and nonlinear DGPs can
also be defined in terms of mean squared forecast errors (MSFEs). In Section 4.5, we
discuss a frequency domain linearity test statistic based on an additivity property
of the bispectrum of the innovation process of a stationary linear Gaussian process.
The bispectrum is used to check if the best predictor of an observed time series is
linear, and the series is deemed to be linear if this null hypothesis is not rejected
against the alternative hypothesis that the best forecast is quadratic. Section 4.6
contains a summary of numerical studies related to the size and power of most of the
test statistics discussed in this chapter. Finally, in Section 4.7, we apply a number
of test statistics to the six time series introduced in Chapter 1.
4.1 Bispectrum
Apart from Section 4.5, throughout this chapter we assume that {Yt }Tt=1 is a time
series arising from a real-valued third-order strictly stationary stochastic process
{Yt , t ∈ Z} that – for ease of notation – is assumed to have mean zero. One basic
tool for quantifying the inherent strength of dependence is the ACVF given by
γY () = E(Yt Yt+ ) ( ∈ Z). For testing nonlinearity and non-Gaussianity, another
useful function is the third-order cumulant, defined as γY (1 , 2 ) = E(Yt Yt+1 Yt+2 ),
(1 , 2 ∈ Z). Both functions are time invariant and unaffected by permutations in
their arguments, which creates the symmetries
where ω denotes the frequency. A sufficient, but not necessary, condition for the
existence of the spectrum is that ∞
=−∞ |γY ()| < ∞.
4.1 BISPECTRUM 121
If, in addition, ∞1 ,2 =−∞ |γY (1 , 2 )| < ∞, then the bispectral density function,
or bispectrum, exists and is defined as the bivariate, or double, FT of the third-order
cumulant function,
∞
fY (ω1 , ω2 ) = γY (1 , 2 ) exp{−2πi(ω1 1 + ω2 2 )}, (ω1 , ω2 ) ∈ [0, 1]2 .
1 ,2 =−∞
(4.4)
Note that in a similar fashion higher-order spectral functions can be defined whose
corresponding multi-dimensional FTs are termed polyspectra. The spectrum is real-
valued and nonnegative. In contrast, the bispectrum and higher-order spectra are
complex-valued.
In view of (4.1) – (4.4), we have the relations,
The third-order cumulant and the bispectrum are mathematically equivalent, as are
the spectrum and the ACVF. Clearly fY (ω) is symmetric about 0.5. From (4.4),
and due to the periodicity of the FT (4.3), the bispectrum in the entire plane can
be determined from the values inside one of the twelve sectors shown in Figure 4.1.
Therefore, it is sufficient to consider only frequencies in the first triangular region
(cf. Exercise 4.1), which we define as the principal domain
=0 (4.8)
122 4 FREQUENCY-DOMAIN TESTS
AHH 6ω2
A HH
A HH
(−1, 1) H (1, 1)
@ A HH
@ 4 A
3 HH
H @ A H
A HH 5 @ A A
A H
H @ A 2
A
A HH @ A A
A 6 H @A 1 A
HH
A
@A
HH A ω
A AH A -1
@
A A@HH A
A 7 A @ H 12 A
H HH
A A @ A
A A HH A
8 @
AA
A @
11 HH AA
HH A @
HH 9 A
10 @
HH
A @
(−1, −1) HH A (1, −1)
HH A
H A
H
Figure 4.1: Values of fY (ω1 , ω2 ) defined over the entire plane, as completely specified by
the values over any one of the twelve labeled sectors.
fY (ω1 , ω2 )
BY (ω1 , ω2 ) = , (ω1 , ω2 ) ∈ D. (4.9)
fY (ω1 )fY (ω2 )fY (ω1 + ω2 )
The third-order cumulant of the general linear causal process (1.2) is given by
∞ ∞
∞
γY (1 , 2 ) = E ψj εt−j ψj εt+1 −j ψj εt+2 −j
j=0 j =0 j =0
∞
= E(ε3t ) ψj ψj+1 ψj+2 .
j=0
∞
∞
fY (ω1 , ω2 ) = E(ε3t ) ψ ψ+1 ψ+2 exp{−2πi(ω1 1 + ω2 2 )}
1 ,2 =−∞ =0
4.1 BISPECTRUM 123
∞
∞
Combining (4.10) and (4.11), the square modulus of the normalized bispectrum,
called frequency bicoherence , is simply
Given actual data of size T , consistent estimates of the spectrum and bispectrum
can be obtained through various techniques. Broadly these techniques can be clas-
sified into three categories: nonparametric or conventional methods, parametric or
model-based methods (e.g. AR modeling), and criterion-based methods (e.g. Burg’s
(1967) maximum entropy algorithm). The first category includes two classes: the
direct method which is based on computing the third-order extension of the sample
periodogram, known as the third-order periodogram , and the indirect method , which
is the extension of the FT of the sample ACVF to the third-order cumulant. Both
methods are easy to understand and easy to implement, but are limited by their
resolving power when T is small, i.e., the ability to separate two closely spaced
harmonics. Nevertheless, conventional methods dominate the literature.
124 4 FREQUENCY-DOMAIN TESTS
T −1
1
IT (ω) = Y () exp{−2πiω},
γ ω ∈ [0, ], (4.15)
2
=−(T −1)
−
Y () = T −1 Tt=1
where γ Yt Yt+ . The periodogram, however, is not a consistent
estimator of fY (ω). Similarly, the third-order periodogram, is an inconsistent es-
timator of fY (ω1 , ω2 ). Consistent estimators of fY (ω) and fY (ω1 , ω2 ) are obtained
by “smoothing” the periodogram and third-order periodogram, and the resulting
estimators are defined as
M 1
fY (ω) = λ Y () exp(−2πiω),
γ ω ∈ [0, ], (4.16)
M 2
=−M
M 2
fY (ω1 , ω2 ) =
1
λ , Y (1 , 2 ) exp{−2πi(ω1 1 + ω2 2 )},
γ
M M
1 ,2 =−M
(ω1 , ω2 ) ∈ D, (4.17)
−β
where γ Y (1 , 2 ) = T −1 Tt=1 Yt Yt+1 Yt+2 , with β = max{0, 1 , 2 }, (1 , 2 =
0, 1, . . . , T − 1) and 1 ≤ M T (truncation point ).
The function λ(·) is a lag window, satisfying λ(0) = 1 and the symmetry condition
(4.1). Furthermore, λ(·, ·) is a two-dimensional lag window satisfying the same
symmetries as the third-order moment, and is real-valued and finite. A standard
window is Parzen’s lag window, which is defined as
⎧
⎨ 1 − 6u2 + 6|u|3 , |u| ≤ 12 ,
λ(u) = 2(1 − |u|)3 , 2 |u| ≤ 1,
1
(4.18)
⎩
0, |u| > 1.
Figure 4.2: (a) A realization of the diagonal BL(0, 0, 1, 1) process Yt = 0.4Yt−1 εt−1 +εt
i.i.d.
with {εt } ∼ N (0, 1); (b) Three-dimensional plot of γY (u, v); (c) Contour plot of the fre-
quency bicoherence estimates of the BL process in (a); (d) Contour plot of the bicoherence
i.i.d.
of a series generated by the AR(1) process Yt = 0.4Yt−1 + εt with {εt } ∼ N (0, 1). Super-
imposed is a plot of the principal domain (4.7); T = 100.
The process is stationary and ergodic if |λ| < 1. According to Kumar (1986),
the third-order cumulant is given by
⎧
⎪
⎪ 2λ 3 σ 3 4+5λ2 , (1 , 2 ) = (0, 0),
⎪
⎪ ε 1−λ2
⎪
⎪ 4 2 4
⎪ 2βσε (1+λ2 +λ ) ,
⎪ (1 , 2 ) = (1, 1),
⎪
⎨ 4β 3 σε61−λ
(1+2λ2 σε2 +3λ4 σε4 )
γY (1 , 2 ) = 1−λ2
, (1 , 2 ) = (1, 0), (4.20)
⎪
⎪ β 3 σ 6 , ( , ) = (2, 1),
⎪
⎪ ε 1 2
⎪
⎪ 2 +4
6β 22 +1 σε 2 (1+λ2 +2λ4 )
⎪
⎪ , (1 = 0, 2 = 2, 3, . . .),
⎪
⎩ 1−λ2
0, otherwise.
(iii) Use (4.17) at each of the (ωjp , ωkq ) in the finer grid, to obtain fY (ωjp , ωkq ),
as N unbiased, approximately uncorrelated, estimates of fY (ωj , ωk ).
(v) The test statistic for Gaussianity is developed as a complex analogue of Ho-
∗ A−1 η
telling’s T 2 test statistic. Specifically, calculate the statistic T12 = N η ,
where A = N Σf and ∗ denotes complex conjugate. For practical application,
it is recommended to use the test statistic
2(N − P ) 2
F1 = T1 . (4.21)
2P
(1)
Under H0 , and as T → ∞,
D
F1 −→ Fν1 ,ν2 (4.22)
1
Choosing K as a multiple of T results in ordinates that directly match the Fourier frequencies.
128 4 FREQUENCY-DOMAIN TESTS
Figure 4.3: (a) Principal domain for the bispectrum with frequency pairs (ωjp , ωkq ) (blue
dots) (p = −2, −1, 0, 1, . . . , 2; q = −2, −1, 1, 2) and designated frequency pairs (red stars) for
d = 8, T = 250 ; (a) K = 6, and (b) K = 7.
⎛ ⎞
1 −1 0 · · · 0 0
⎜ 0 1 −1 · · · 0 0 ⎟
⎜ ⎟
B=⎜ .. .. .. .. .. .. ⎟.
⎝ . . . . . . ⎠
0 0 0 · · · 1 −1
(2)
Under the null hypothesis H0 , β is asymptotically jointly normally distributed with
mean 0, and variance-covariance matrix BΣZ B .
Given the above results, the remaining part of the procedure to compute the test
statistic goes as follows.
4.2 THE SUBBA RAO–GABR TESTS 129
= BZ,
β and = BS
S Z B ,
where
N
N
Z = N −1 Z∗i , and Z = N −1
S (Z∗i − Z)(Z∗i − Z)
i=1 i=1
N −P +1 2
F2 = T2 , (4.23)
P −1
S
where T22 = N β Under H(2) , and as T → ∞,
−1 β.
0
D
F2 −→ Fν1 ,ν2 (4.24)
4.2.3 Discussion
There are some drawbacks to the test statistics (4.21) and (4.23). Typically the
user has to decide on the choice of the lag window, the truncation point M , and
the placing of the grids, i.e., the parameters d, K, and r. Based on 500 generated
BL(2, 1, 1, 1) time series W.S. Chan and Tong (1986) note that the results of the
Subba Rao–Gabr linearity test statistic is sensitive to the choice of the lag window.
The choice of the truncation point M is another delicate issue; see, e.g., Subba Rao
and Gabr (1984, Section 3.1) for various suggestions. One recommendation is that
M < T 1/2 . A more formal approach is to minimize the mean squared error (MSE)
of the bispectral estimate, which is a function of fY (ω1 ), fY (ω2 ) and fY (ω1 , ω2 ),
with respect to M .
The parameters d, K, and r should be chosen as follows. First, it is required
that N × [2K/3] < T , where [ · ] denotes the integer part; see step (iv) of Algorithm
4.1. Next, to ensure that the spectral and bispectral estimates at different points
of the grid are effectively uncorrelated, it is necessary to choose d such that d/T is
larger than the spectral window corresponding to the lag window λ(s). Similarly, r
should be chosen such that r/T is less than the lag window. Finally, to ensure that
points in different fine grids do not overlap, it is essential that d ≤ T /{K(r + 1)}. In
summary, great skill is necessary in applying both test statistics (4.21) and (4.23)
130 4 FREQUENCY-DOMAIN TESTS
where
T
Y (ωj ) = Yt exp{−2πiωj (t − 1)}.
t=1
Since Y (ωj+T ) = Y (ωj ) and Y (ωT −j ) = Y ∗ (ωj ), the principal domain of FY (ωj , ωk )
is the triangular set
1 −1
mM
fY (ωm , ωn ) = 2 FY (ωj , ωk ), (4.27)
M
j,k=(m−1)M
with M = T c ( 12 < c < 1). The complex variance of this estimator, assuming
the terms in the summations are restricted to , excluding the manifolds ωm = 0,
ωm = ωn , is given by
T
Var{fY (ωm , ωn )} = Qm,n fY (δm )fY (δn )fY (δm+n ) + O(M/T ),
M4
where δx = (2x − 1)M/(2T ) and Qm,n is the number of (j, k) in the squares that
are in , but not on the boundaries j = k or (2j + k) = T , plus twice the number
on these boundaries. Note, T M −4 Qm,n ≤ T M −2 = T 1−2c → 0 if T → ∞, since
Qm,n ≤ M 2 .
4.3 HINICH’S TESTS 131
Figure 4.4: (a) Lattice in the principal domain for the bispectrum with K = 10, and
r = 5; (b) Lattice L in the principal domain of the bispectrum for estimating Hinich’s test
statistics; T = 144 and c = 1/2.
It can be shown (Hinich, 1982) that the asymptotic distribution of each estimator
is complex normal, and that the estimators are asymptotically independent inside
the principal domain. Therefore, the distribution of the statistic
is complex normal with unit variance, with fY (·) the estimator of the spectral dens-
ity function constructed by averaging M adjacent periodogram ordinates. Now
2|BY (ωm , ωn )|2 is approximately distributed as χ2 (λm,n ), i.e. a noncentral chi-square
2
distribution with two degrees of freedom and noncentrality parameter
λm,n = 2(T 1−4c Qm,n )−1 |BY (ωm , ωn )|2 ≥ 2T 2c−1 |BY (ωm , ωn )|2 . (4.29)
Thus, the value of (4.29) increases when a smaller set of frequency pairs (ωm , ωn ) is
considered.
The choice of the parameter c controls the trade-off between the bias and variance
of BY (·, ·). The smallest bias is obtained for c = 1/2, whereas the smallest variance
is for c = 1. The power of the test for a zero bispectrum depends on T 1/2 when
T 1−c is large, c should be slightly larger than 1/2 to give a consistent estimate.
μ23,ε
λm,n = 2T 2c−1 ≡ λ0 .
σε6
132 4 FREQUENCY-DOMAIN TESTS
Thus, the noncentrality parameter becomes a constant. Since E(|B Y (ωm , ωn )|2 )
Y (ωm , ωn )
= 1 + λm,n /2, it follows from (4.29) and the asymptotic properties of B
that the parameter λ0 can be consistently estimated by
0 = 2
λ Q m,n | Y (ωm , ωn )|2 − 1 ,
B (4.30)
PM2
(m,n)∈L
and fχ22 (λ0 ) (·) is the density function of a χ22 (λ0 ) random variable. It is not difficult
to estimate q0.25 , q0.75 , and (4.32) for a given value of λ0 . In practice, the estimator
(4.30) is used in the computations of these values.
(2)
which is asymptotically distributed as a central χ22P variate under H0 , with P ≈
T 2 /(12M 2 ); see (4.30). Note that (4.33) is essentially the Subba Rao–Gabr test
statistic T12 , i.e., instead of using an estimate of the bispectral density in the sum of
squares (4.33) uses an estimate of the normalized bispectrum.
4.4 RELATED TESTS 133
4.3.3 Discussion
For relatively large sample sizes Ashley et al. (1986) examine in an MC simulation
study the size and power of Hinich’s linearity and Gaussianity test statistics. Over-
all, the sizes of these test statistics are satisfactory. What seems more important,
however, is that the power of the linearity test statistic is disturbingly low in distin-
guishing between linear and nonlinear time series processes. In particular, this seems
to be the case for ExpAR and SETAR behavior. Furthermore, Harvill and Newton
(1995) show that uncommonly large time series sample sizes are necessary before
the normal distribution in (4.32) is reliable for calculating p-values. Additionally,
these authors point out that the asymptotics of this problem are present in three
interwoven forms: the length T of the observed time series, the number of points
M used to estimate the normalized bispectrum, and the number P of normalized
bispectral estimates used in calculating the IQR. For instance, to have P = 100
requires a series of length T = 1,200 when using M = T 1/2 .
Although Hinich’s approach is robust to outliers in the case of linearity, a dis-
advantage of using the IQR is that if the null hypothesis is false and the process is
of a type of nonlinearity which would result in a peak in |BY (ωm , ωn )|2 , the range
effectively ignores that distinguishing feature. So the test statistic may differentiate
between linear and nonlinear processes but provides no clue as to the form of non-
linearity. To some extent this may be overcome by visually assessing plots of the
frequency bicoherence.
More importantly, Garth and Bresler (1996) raise some concerns with the as-
sumptions required to form the linearity test statistic. As the number of discrete
FT values of {Yt }Tt=1 increase as T → ∞, the assumption that |B Y (ωm , ωn )|2 will
2
converge to the proposed noncentral χ2 (λ0 ) distribution is violated, as this requires
a finite number of bispectral estimates. Ignoring the finite-dimensionality constraint
leads to a different asymptotic distribution; it can also lead to dependence between
two estimates, smoothed over distinct frequency regions. The dependence is elimin-
ated by summing the discrete FT over a finite subset of points, which is true for the
indirect estimate of the bispectrum. This approach, however, introduces the addi-
tional problem of carefully choosing the spectral bandwidth M , as with the Subba
Rao–Gabr test statistics.
the difference between the empirical distribution function (EDF) of 2|B Y (ωm , ωn )|2
2
and the noncentral χ2 (λm,n ) as the null distribution.
Unfortunately, finding the null distribution of the resulting EDF-based test stat-
istic is intractable. Jahan and Harvill (2008) overcome this problem by approximat-
ing the noncentral χ22 (·) distribution by a normal distribution in the following way.
Let X ∼ χ2ν (λ). Then a remarkably accurate approximation (Sankaran, 1959) for
the tails of the χ2ν (λ) distribution consists of replacing X by Y = (X/(ν + λ))h ,
where the exponent h is given by
2(ν + λ)(ν + 3λ)
h=1− . (4.34)
3(ν + 2λ)2
Specifically, Y has an approximate normal distribution with mean and variance given
respectively by
ν + 2λ (ν + 2λ)2
μY = 1 + h(h − 1) − h(h − 1)(2 − h)(1 − 3h) , (4.35)
(ν + λ)2 2(ν + λ)4
2(ν + 2λ) ν + 2λ
σY2 = h2 1 − (1 − h)(1 − 3h) . (4.36)
(ν + λ)2 (ν + λ)2
If λ is unknown, it is recommended to replace λ by the method of moment based
estimator
Y − ν if Y > ν,
λ= (4.37)
0 otherwise,
1
P
AD = −P − (2i − 1) log Q(i) + log(1 − Q((P +1)−i) ) , (4.39)
P
i=1
4.4 RELATED TESTS 135
assuming Q(i) = 0 or 1.
For testing linearity both mean and variance of the transformed random vari-
ables are unknown. In that case these quantities are estimated by BY , the sample
mean of the B Y (ω (1) , ω (2) ) (i = 1, . . . , P ), and the sample standard variance (P −
i i
Y (ω (1) , ω (2) ) − BY )2 . Then, according to Stephens (1986, Table 4.9),
1)−1 Pi=1 (B i i
the asymptotic upper-tail p-value can be computed from first transforming CvM to
the modified (m) statistic CvM m = CvM(1+0.5/P ) and next calculating a parabolic
approximation, i.e.,
Below we summarize the two-stage procedure for testing for Gaussianity and
linearity.
(c) Compare the value of the test statistic with the appropriate critical
value.
where
{fχ22 (λm,n ) (q0.9 ) − fχ22 (λm,n ) (q0.1 )} − {fχ2 (λ
0 ) (q0.9 ) − fχ2 (λ
0 ) (q0.1 )}
2 2
IDRM =
0
σ
(4.41)
is the standardized IDR fractile. The estimate σ 02 of σ02 follows from (4.32) with
fχ22 (λ0 ) (·) replaced by fχ2 (λ
0 ) (·). The use of the IDR rather than the IQR in (4.41) is
2
in line with Hinich et al. (2005) who, from numerous real and artificial applications,
notice that the IDR gives more robust test results.
In an analogous way, maximal test statistics can be defined on the basis of the
IQR, and 80% fractiles of B Y (ωm , ωn ). Following the same arguments as in Hinich
(1982), it can be shown that all these maxi-minimal test statistics are asymptotically
distributed as N (0, 1) under the null hypothesis that {Yt , t ∈ Z} is a linear DGP, as
defined by (1.2).
λ(ω) = λ(−ω),
λ(ω1 , ω2 ) = λ(ω2 , ω1 ) = λ(−ω1 , ω2 − ω1 ). (4.42)
Clearly, both conditions mimic (4.1) and (4.2), or (4.5) and (4.6). But condition
(4.42) is not required for proving consistency or asymptotic normality of (4.17).
(1) (2)
Let ωj = (ωj , ωj ) (j = 1, . . . , P ) denote the jth frequency pair in the lattice
L. Then, as already noted in Section 4.2, the kernel estimators fY (ω , ω ) as in
(1) (2)
j j
(4.17) are approximately complex Gaussian with variance
M2
Var{fY (ωj , ωj )} =
(1) (2) (1) (2) (1) (2)
W2 fY (ωj )fY (ωj )fY (ωj + ωj ), (4.43)
T
where
∞ ∞
(1) (2) (1) (2)
W2 = λ2 (ωj , ωj )dωj dωj . (4.44)
−∞ −∞
fY (ωj , ωj )
(1) (2)
ZY (ωj , ωj ) =
(1) (2)
. (4.45)
{M 2 W2 /T }1/2 {fY (ωj )fY (ωj )fY (ωj + ωj )}1/2
(1) (2) (1) (2)
Hence, the statistics 2|Z Y (ω (1) , ω (2) )|2 (j = 1, . . . , P ) are asymptotically dis-
j j
tributed as independent noncentral χ22 variates, with noncentrality parameter
(1) (2) (1) (2) (1) (2)
|fY (ωj , ωj )|2 /(M 2 W2 /T )fY (ωj )fY (ωj )fY (ωj +ωj ). For the purpose of test-
ing linearity and Gaussianity, the set of random variables 2|ZY (ω , ω )|2 for all
(1) (2)
j j
(1) (2)
(ωj , ωj ) is considered to be a random sample from a continuous distribution with
CDF F (·).
Before detailing the steps involved in the AR(∞)-sieve bootstrap procedure, we
collect the spectral and bispectral density estimators into one long vector, i.e.,
VT = fY (ω1 ), . . . , fY (ωP ), fY (ω1 ), . . . , fY (ωP ), fY (ω1 + ω1 ), . . . ,
(1) (1) (2) (2) (1) (2)
(2)
Figure 4.5: Profiles of the Parzen lag window (black solid line) given by (4.18), and the
trapezoid-shaped lag window (blue medium dashed line) as given by (4.50).
Depending on the purpose of the analysis, one of the above three hypotheses are
considered in the following bootstrap algorithm.
(4)
• When testing for H0 :
(3)
(a) Draw T − p independent bootstrap residuals ε+
t from FT .
∗
(b) Transform the ε+ +
t ’s into pseudo-observations εt = St εt with
i.i.d.
{St } ∼ U [−1, 1], where U denotes the discrete uniform distri-
bution on −1 and 1.
(4)
(c) Obtain the corresponding EDF FT .
4.4 RELATED TESTS 139
(i) (b)
(iii) Compute the vector of pseudo-statistics VT (Yt ) (i = 3, 4, 5) analogous to
(b)
VT , but with the series {Yt } generated from the fitted AR(p) model with
(b) i.i.d. (i)
error process {εt } ∼ FT .
(i) (b)
(iv) Repeat steps (ii) – (iii) B times, to obtain {VT (Yt )}B b=1 (i = 3, 4, 5).
The EDF of these bootstrap statistics can then be used to approximate
(i)
the distribution of VT under H0 (i = 3, 4, 5). In Table 4.1 we label the
L+nG L+S G
corresponding test statistics, based on the IQR, as: ZIQR , ZIQR , and TIQR .
(i)
(v) Reject H0 (i = 3, 4, 5) when the p-value is less than a pre-specified signific-
ance level.
Suppose, in addition to the assumptions imposed on γY (·) and γY (·, ·), that
∞
∞
2 |γY ()| < ∞, and (1 + 2j )γY (1 , 2 ) < ∞ (j = 1, 2). (4.49)
=−∞ 1 ,2 =−∞
Then Berg et al. (2010) prove the asymptotic consistency of the bootstrap test
procedure under both the null hypothesis and the alternative hypothesis. They
estimate the spectrum by a trapezoid-shaped lag window function (see Figure 4.5),
and the bispectrum with a right-pyramidal frustum-shaped lag function (see Figure
4.6(a)). These functions are, respectively, defined by
where
+
1 − max(|x|, |y|) , −1 ≤ x, y ≤ 0 or 0 ≤ x, y ≤ 1,
λ0 (x, y) =
+
1 − max(|x + y|, |x − y|) , otherwise,
with (x)+ = max(0, x). Both infinite-order functions can produce higher-order ac-
curate estimators of the spectral and bispectral densities.
4.4.4 Discussion
Similar to the original Hinich’s test statistics, the user of the AD- and CvM-type
test statistics has to select M (the bispectral bandwidth), and P (the number of
140 4 FREQUENCY-DOMAIN TESTS
gridpoints). Consequently, the test statistics may still be sensitive to these user-
specified parameters within the EDF framework. The automatic choice of M in the
maximal test (4.40) reduces the bias-variance trade-off associated with the Hinich
linearity test statistic. However, the resulting MD IDR L test statistic still relies on the
where the coefficients cj and cjv are chosen such that minimum of MSFE(1) is
achieved. If {Yt , t ∈ Z} is non-Gaussian, then the one-step ahead quadratic forecast
has a smaller asymptotic MSFE than the one-step ahead linear forecast (cf. Exercise
4.2(b)).
H0 : E[{Yt+1 − Yt+1|t
Q
} − {Yt+1 − Yt+1|t
LS
}]2 = E[Yt+1|t
LS
− Yt+1|t
Q
]2 = 0, (4.55)
H1 : E[Yt+1|t
LS
− Yt+1|t ]2 > 0.
Q
(4.56)
Assume that the fourth-order moments of {Yt , t ∈ Z} exists, and let fY (ω)
satisfy the so-called Szegö condition, i.e., ∫01 log fY (ω)dω > −∞, and assume all
finite-dimensional distributions of {Yt , t ∈ Z} have a positive spectrum. Then, in
view of the symmetry relations (4.2), it can be shown (Terdik and Máth, 1993) that
LS Q
a necessary and sufficient condition for equivalence of Yt+1|t and Yt+1|t is that the
bispectrum fe (ω1 , ω2 ) of the innovation process has the additive form
where H(ω) = ∞ j=0 γe (j, j) exp(−2πiωj ). The functions fe (·, ·) which satisfy (4.57)
are exactly those for which the following relation holds. For any triplet (α, β, γ)
This relationship forms the basis of the proposed linearity test statistic.
Test statistic
Consider the third-order periodogram of {et }Tt=1
T −1
1
fe (ω1 , ω2 ) = W1 (u, v)Fe (u/T, v/T ), (4.59)
(T bT )2
u,v=1
which implies
σ 6 W2
σ 6 W2
lim T b2T Var{ fe (ω1 , ω2 ) } = e and lim T b2T Var{ fe (ω1 , ω2 ) } = e ,
T →∞ 2 T →∞ 2
Now, let (α, β, γ) denote a fixed triplet such that the map of Ti (·, ·) (i = 1, . . . , 4) of
the six points
(α, β), (γ, 0), (−α + γ, −β − γ), (β, γ), (0, −α − β), (−α + γ, −γ)
+ fe (−α + γ, −γ) .
K
Mj,T (α, β, γ) = K −1/2
(K) (i)
Rj,K (α, β, γ) (j = 1, 2). (4.63)
i=1
(K)
Under H0 , the expectation and variance of Mj,T (α, β, γ) (j = 1, 2) are respectively
approximately equal to zero and unity. The resulting test statistic is given by
144 4 FREQUENCY-DOMAIN TESTS
(K)
Under H0 , and as T → ∞, GT has a χ22 distribution.
Computation
Clearly, (4.64) is computed for only one set of triplets in D. Generalizing to n
sets of triplets, each consisting of K stretches, is direct. The various stages in the
computation of the resulting test statistic can be summarized as follows.
The above window is optimal in the sense that it minimizes the MSE of the
bispectral estimate. For this window, evaluation of (4.44) gives W2 = 1.4628.
Figure 4.6(b) shows a plot of the profile of (4.65).
(iv) Using n = 7 triplets (αi , βi , γi ), construct the two 3 × 2 matrices with indices
⎛ ⎞ ⎛ ⎞
αi βi βi γi
N ⎜ ⎟ N ⎜ ⎟
⎝ γi 0 ⎠, ⎝ 0 −αi − βi ⎠ ,
64 64
−αi + γi −βi − γi −αi + γi −γi
(i = 1, . . . , n).
If an index is negative, then add N to its value. Let (u, v)i and (u∗ , v)i
(u, u∗ = 1, 2, 3; v = 1, 2) denote the resulting index for the ith triplet,
corresponding to either the first or the second matrix. For instance, for
N = 26 = 64, it is recommended to use the set of n = 7 triplets given by
{(αi , βi , γi )}7i=1 = {(17, 27, 30), (17, 21, 10), (17, 24, 27), (18, 27, 14),
(18, 21, 24), (19, 30, 1), (21, 27, 9)}. (4.66)
4.5 A MSFE-BASED LINEARITY TEST 145
Figure 4.6: (a) Profile of the flat-top two-dimensional window function (4.51) used with
the bootstrap-based test statistics in Algorithm 4.4; (b) Profile of the two-dimensional lag
window (4.65) used in (4.59).
3
3
Qi = fε(ω(u,1)i +1 , ω(u,2)i +1 ) − fε(ω(u∗ ,1)i +1 , ω(u∗ ,2)i +1 ),
u=1 u∗ =1
(i = 1, . . . , n).
(vi) Form the vector Q = (Q1 , . . . , Qn ) , and compute the test statistic
(K) N b2N
Gn,T = K ×
Q
2 , (4.67)
3W2
Note that for the construction of the test it is assumed that the coefficients ψi in
(4.52) and the coefficients cj , cju in (4.54) are known. In practice these coefficients
need to be estimated. However, under not too restrictive conditions on {et }, it
Q
can be shown (Matsuda and Huzii, 1997) that the quadratic predictor Yt+1|t has a
smaller asymptotic MSE than the LS predictor Yt+1|t , if p ≥ p , where p and p∗ are
LS ∗
limits imposed on the infinite summations on the right-hand side of (4.52) and (4.54)
respectively. Thus, H0 can still be tested using the statistic (4.67) if the unknown
parameters are replaced by least squares estimates.
Discussion
One disadvantage of the above method of smoothing the bispectrum into K equal
nonoverlapping records of size N is that information will be lost at lower frequencies,
146 4 FREQUENCY-DOMAIN TESTS
the maximum cycle that we can now observe is for frequency N instead of frequency
T . Also, since K = T /N will not be an integer in general, some observations
at the end of the series may be left out of the computation of the test statistic.
Clearly, the alternative hypothesis H1 presents limitations in that it only examines
second-order features in departures from the null hypothesis. Terdik and Máth
(1998) compare the power of the test statistic (4.31) with Hinich’s linearity test
(K)
statistic for a number of (non)linear models, but Gn,T only shows an improvement
for linear Hermite polynomial data. Applications of the Terdik–Máth test statistic
are reported by, for instance, Terdik (1999), Terdik and Máth (1993), and Terdik et
al. (2002).
• The empirical rejection levels (sizes) for linear DGPs with Gaussian distributed
errors from many simulation studies are not always at the nominal rejection
level, which in most studies is preset at 5%. Hence, it is somewhat unfair to
compare the powers of test statistics that have different sizes.
• The bootstrap test statistics give generally better power results than Hinich’s
L
Gaussianity and linearity tests. The classical Hinich linearity test, ZIQR , gives
poor answers for very short series as it often has too few independent values
to form an IQR.
• Of the three maximal linearity test statistics the maximal IDR test statistic,
L
ZIDR , has the largest power improvement over the Hinich linearity test, which
reinforces the conjecture that by carefully tweaking the user-specified paramet-
ers some improvement of the Hinich linearity test can be obtained. However,
the overall performance of the IDR test statistic is quite limited for data gener-
ated from a two-state Markov(2, 1) model, an EAR(2, 1) model, and a rational
nonlinear AR model.
• The power of the ADG G
m and CvMm test statistics is comparable with that
achieved by the Hinich test statistic T G , but often higher, especially in the
case of data generated from a SETAR(2; 1, 1) model.
Table 4.1: Summary of size and power MC simulation studies for some frequency-domain
Gaussianity (G) and linearity (L) test statistics.
Table 4.2: Indicator pattern of p-values of the Gaussianity (G) and linearity (L) test
statistics; ∗∗ marks a p-value < 0.01, ∗ marks a p-value in the range 1% − 5%, and † a
p-value > 0.05.
Gaussianitiy (G) Linearity (L)
GOF Tests(1) Btstrp(2) GOF Tests(1) Btstrp(2) MSFE(3)
(K)
Series ADG G
m CvMm TG ADL L
m CvMm
L
ZIQR L
ZIDR L
Z80% G7,T
Unemployment rate(4) ∗∗ ∗ ∗ † † ∗∗ ∗∗ ∗ †
EEG recordings ∗∗ ∗∗ ∗∗ † † ∗∗ ∗∗ ∗∗ ∗∗
Magnetic field data ∗∗ ∗ † ∗∗ † † † † ∗∗
ENSO phenomenon ** † † † † † † † ∗∗
Climate change: δ 13 C † ∗∗ † † † † † † ∗∗
δ 18 O ∗∗ ∗ † † † ∗∗ ∗∗ ∗∗ †
(1) M = 18 for all series.
(2) Based on 1,000 bootstrap replicates, and M = T 0.6 for all series.
(3) Based on stretch lengths N = 27 (Unemployment, δ 13 C, and δ 18 O), N = 28 (ENSO)
N = 29 (EEG), N = 210 (Magnetic field data); window-width N bN = 8, pmax = 24.
(4) First differences of original series.
a time-dependent definition. Subba Rao and Gabr (1984) update their original frequency
domain tests to include frequencies along the manifold ωj = 0. Zoubir and Iskander (1999)
propose a bootstrap-based approach for testing departures from Gaussianity. Their simula-
tion results confirm that the Subba Rao–Gabr test statistic is a test of symmetry and not
pure Gaussianity. Nichols et al. (2009) provide an analytical expression for the bispectrum
and bicoherence functions for quadratically nonlinear DGPs subject to stationary, jointly
non-Gaussian distributed error processes possessing an arbitrary ACF.
Lii and Masry (1995) and Lii (1996) consider estimation of the bispectral density function of
continuous stationary DGPs when the data are obtained on unequally spaced time intervals.
Subba Rao (1997) gives an illustration of the usefulness of bispectra to analyze nonlinear,
unequally spaced, astronomical time series. Related to the analysis of continuous time
series, the problem of aliasing may arise when a real frequency in the series is not matched
by a Fourier frequency in the observed data. Testing for aliasing can be performed by an
amended version of the Hinich bispectrum test statistic for Gaussianity; see Hinich and
Wolinsky (1988).
Harvill et al. (2013) propose a bispectral-based procedure to distinguish among various non-
linear time series processes and between nonlinear and linear time series processes through
application of a hierarchical clustering algorithm.
Barnett and Wolff (2005) advocate the time-domain third-order moment γY (1 , 2 ) for test-
ing nonlinearity over using the bispectrum. For a linear stationary time series the estimated
values of the third-order moment are correlated. This complicates the construction of a para-
metric test. They overcome this problem by using the so-called phase scrambled bootstrap
procedure (Theiler et al., 1992), a frequency domain procedure. The method is computa-
tionally less intensive and more powerful than the Hinich test statistic. Three MATLAB files
are available at http://www.mathworks.nl/matlabcentral/fileexchange/16062-test-
of-non-linearity. These files are: third.m (calculates the 3rd-order moment for a time
series), aaft.m (calculates the Amplitude Adjusted FT), and boot.m (calculates a bootstrap
test for nonlinearity).
Section 4.2: Based on the evolutionary second-order spectrum and bispectrum (see, e.g.,
Priestley and Gabr (1993)), Tsolaki (2008) proposes test statistics for Gaussianity and lin-
earity of nonstationary slowly varying time series processes. These test statistics are gener-
alizations of the Subba Rao–Gabr tests for stationary processes.
Section 4.3: The use of a square shaped uniform smoothing window in the direct estim-
ator of the bispectrum in Hinich’s linearity and Gaussianity test statistics may introduce
severely biased estimates in relatively small areas of the bispectrum, and hence may lead to
a false acceptance of the null hypothesis with large probability. To ameliorate this problem,
Birkelund and Hanssen (2009) obtain an improved version of Hinich’s tests by proposing
a hexagonal shaped smoothing window. Yuan (2000a) investigates the effect of estimating
the noncentrality parameter λ0 on the asymptotic level of Hinich’s linearity test, and he
introduces a modification. The modified test also uses the IQR, but it tests the equality
of location parameters and its critical value does not depend on any unknown parameters.
In another paper, Yuan (2000b) extends Hinich’s Gaussianity and linearity test statistics to
stationary random fields on Zm (m = 1, 2, . . .).
Section 4.7: Ashley and Patterson (1989), and Hinich and Patterson (1985) apply the
Subba Rao–Gabr test statistics and the Hinich test statistics to various real economic time
series. Brockett et al. (1988) and Patterson and Ashley (2000) present applications of these
4.10 SOFTWARE REFERENCES 151
tests with series taken from other areas, including examples from, finance, engineering,
and geophysics. Teles and Wei (2000) investigate the performance of various linearity test
statistics, including Hinich’s linearity test, on time series aggregates. Temporal aggregation
greatly hampers the detection of nonlinear DGPs.
Drunat et al. (1998) compare the Hinich and the Subba Rao–Gabr linearity tests on a set of
exchange rates. A modified version of the original Hinich linearity test statistic forms a part
of a single-blind controlled competition among five linearity tests, and results are reported by
Barnett et al. (1997). Hinich et al. (2005) examine the performance of Hinich’s Gaussianity
and linearity tests and the Hinich–Rothman test statistic for time-reversibility (Chapter 8),
using bootstrap and surrogate data simulation methods. Using knowledge of the asymptotic
distribution of the bispectral density function under the null hypothesis of Gaussianity, Epps
(1987) proposes a large-sample GOF-type test statistic based on the difference between the
sample mean estimate and the ensemble averaged value of the characteristic function of the
time series, measured at some specific points. The AR-sieve bootstrap, discussed briefly in
Section 4.4.3, is reviewed in detail in Kreiss and Lahiri (2011).
Section 4.4: The empirical results of the AD- and CvM-type Gaussianity and linearity
test statistics (Table 4.2) can be reproduced with the goodnessfit.m MATLAB function
available at the website of this book. Also available is R code for computing the bootstrapped
form of Hinich’s Gaussianity and linearity test statistics of Section 4.4.3; see Exercise 4.4.
Furthermore, Gyorgy Terdik made available TerM.m, a MATLAB module for calculating
the Terdik–Máth test statistic.
Exercises
Theory Questions
4.1 Prove that the triangular principal domain (4.7) of the bispectral density function
fY (ω1 , ω2 ) is bounded by the manifolds ω1 = ω2 , ω1 = 0, and ω1 = (1 − ω2 )/2.
i.i.d.
4.2 Consider the subdiagonal BL process Yt = βYt−2 εt−1 + εt , where {εt } ∼ N (0, σε2 )
with β 2 σε2 < 1.
152 4 FREQUENCY DOMAIN TESTS
and
σε4 (1 + 2β 2 σε2 )
E(Yt2 Yt−1
2
)= .
(1 − β 2 σε2 )2
(b) The best one-step ahead quadratic predictor for {Yt , t ∈ Z} is given by
Q
Yt+1|t = c1,2 Yt Yt−1 .
Using the moment results in part (a), prove that the coefficient c1,2 is given by
1 − β 2 σε2
c1,2 = β .
1 + 2β 2 σε2
Q
(c) Show that the maximum reduction of the one-step ahead MSFE of Yt+1|t , relative
√
to E(Yt ) = σY , is reached at β σε = ( 3 − 1)/2.
2 2 2 2
4.3 By assuming that the bispectrum is non-zero over the entire region D, and that
fY (ω1 , ω2 ) is partially differentiable once with respect to ω1 , Sakaguchi (1991) shows
that for any triplet (α, β, γ) the bispectrum fY (ω1 , ω2 ) satisfies the relation
fY (α, β)fY (γ, 0)fY (−α + γ, −β − γ) = fY (β, α)fY (0, −α − β)fY (−α + γ, −γ). (∗)
where {εt } and {ηt } are independent and Gaussian i.i.d. processes with zero
mean and unit variance. Show that the bispectrum is given by
(b) Let α = β = 1/4 and γ = 0. Show that for the above nonlinear process the
left-hand side of (∗) is equal to 728 while the right-hand side is equal to 600,
indicating that the series is nonlinear.
4.4 Consider the first differences (USunemplmnt first dif.dat) of the quarterly U.S. unem-
ployment rate, earlier introduced in Example 1.1.
EXERCISES 153
(a) Using the R functions in the file Exercise44.r, write an MC simulation program to
compare Hinich’s Gaussianity test and Hinich’s linearity test with bootstrapped
forms of these tests. To evaluate the test statistics consider 1,000 BS replicates,
and take 20 MC simulations across all tests.
Compare the percentage of rejections of the test statistics at the 5% nominal sig-
nificance level. Are the results sensitive to the user-specified parameters (inputs)
in the simulations?
[Inputs: The number of gridpoints K, a discrete uniform random variable taking
values in the set {3, 4, 5}. The spectral bandwidth Ms = cMb where c ∼ U [1.5, 3]
and the bispectral bandwidth Mb = 4. The bootstrap AR order parameter p, a
discrete uniform random variable taking values in the set {4, 5, . . . , 15}.]
(b) Compare part (a) with the corresponding test results reported in Table 4.2.
Time-domain linearity test statistics are parametric; that is, they test the null hy-
pothesis that a time series is generated by a linear process against a pre-chosen
particular nonlinear alternative. Using the classical theory of statistical hypothesis
testing, time-domain test nonlinearity tests can be based on three principles – the
likelihood ratio (LR), Lagrange multiplier (LM), and Wald (W) principles. LR-
based test statistics require estimation of the model parameters under both the null
and the alternative hypothesis, whereas tests statistics based on the LM principle
require estimation only under the null hypothesis. Application of W-based test stat-
istics implies that the model parameters under the alternative hypothesis need to
be estimated. Hence, in the case of complicated nonlinear alternatives, containing
many more parameters than the model under the null hypothesis, test statistics
constructed from the LM principle are often preferred over test statistics based on
the other two testing principles.
In the first three sections that follow, we introduce these three principles briefly
and show how they yield the most commonly known test statistics for nonlinear-
ity. In Section 5.4, we discuss three test statistics based on a second-order Volterra
expansion. These tests rely on an added variable approach, i.e., nonlinearity can
be seen by examining the strength of the relationship of the residuals of a fitted
linear model with nonlinear terms from a Volterra expansion via an F ratio of sums
of squares of residuals. Evidently, this approach is linked to some of the LM test
statistics proposed in Section 5.1. In Section 5.5, we first introduce the arranged
autoregression principle. Based on this principle, we discuss two test statistics for
SETARs. Then we discuss an F test statistic that combines the added variable ap-
proach with the arranged autoregression principles. Section 5.6 introduces a simple
test procedure for discriminating among different nonlinear time series models.
Two appendices are added to the chapter. Appendix 5.A presents percentiles of
the LR-SETAR test statistic. Appendix 5.B provides a summary of size and power
studies. It includes some remarks about the strengths and weaknesses of the test
statistics.
where
T
T
Σ =
21 = Σ z1,t ,
z2,t and ii =
Σ zi,t ,
zi,t (i = 1, 2),
12
t=1 t=1
5.1 LAGRANGE MULTIPLIER TESTS 157
D
LMT −→ χ2ν2 . (5.6)
εt =
z1,t β1 +
z2,t β2 + ηt , (5.7)
SSE − SSE
0
LMT = T . (5.8)
SSE0
We use the above formulation as a first step to derive various variants of LM test
statistics below. These variants depend on the form of the vector z2,t , which is
determined by the type of nonlinearity investigated.
Bilinear case
Consider the BL(p, q, P, Q) model (2.12). This model reduces to a linear ARMA(p, q)
model if the last term on the right-hand side of (2.12) is zero, i.e., if ψuv = 0 ∀u, v.
Thus, the null hypothesis we wish to test is
(1)
H0 : ψuv = 0, (u = 1, . . . , P ; v = 1, . . . , Q). (5.9)
∂ε (θ)
∂εt (θ)
∂εt (θ)
∂εt (θ)
∂εt (θ)
t
z1,t = , ,..., , ,..., (5.10)
∂φ0 ∂φ1 ∂φp ∂θ1 ∂θq
and
∂ε (θ)
∂εt (θ)
t
z2,t = ,..., , (5.11)
∂ψ11 ∂ψP Q
158 5 TIME-DOMAIN LINEARITY TESTS
∂εt (θ)
q
∂εt− (θ)
=− 1+ θ ,
∂φ0 ∂φ0
=1
∂εt (θ)
q
∂εt− (θ)
= − Yt−i + θ , (i = 1, . . . , p),
∂φi ∂φi
=1
∂εt (θ)
q
∂εt− (θ)
= − εt−j + θ , (j = 1, . . . , q),
∂θj ∂θj
=1
∂εt (θ)
q
∂εt− (θ)
= − Yt−v εt−u + θ , (u = 1, . . . , P ; v = 1, . . . , Q),
∂ψuv ∂ψuv
=1
Now, the asymptotic distribution of the LM test statistic for BL(p, q, P, Q) mod-
els can formulated as follows. Let {Yt , t ∈ Z} be generated by (5.1) with E(ε4t ) < ∞.
Assuming conditions (i) and (ii) are fulfilled, define the LM-type test statistic, de-
(1)
noted by LMT , by substituting (5.10) – (5.11), for the corresponding quantities in
(1)
(5.5).1 Assume that the hypothesis of interest is H0 . Then, as T → ∞,
(1) D
LMT −→ χ2P Q−r(r+1)/2 , (5.12)
(1)
Q ≤ p + 1. Under H0 , the corresponding LM-type test statistic is asymptotically
distributed as χ2P Q . The additional assumption E(ε4t ) < ∞ is not necessary if it is
assumed that {εt } is Gaussian WN.
Exponential AR case
Consider the ExpARMA model in (2.20) with q = 0. There are two possibilities
to reduce the resulting ExpAR(p) model to a linear AR(p). One can either set the
scaling factor γ = 0 or take ξi = 0 (i = 1, . . . , p). Since it appears that the first
possibility is easier to work with, we introduce the null hypothesis
(2)
H0 : γ = 0. (5.13)
Unfortunately, from (2.20) one can immediately see that the ExpAR(p) model is not
(2)
identified when H0 holds, i.e. the parameters ξ1 , . . . , ξp can take any values without
changing the residual sum of squares. As a consequence the relevant inverses in (5.5)
do not exist. To overcome this problem, the idea is to replace exp(·) by a suitable
linear approximation. The resulting test statistic is an LM-type test statistic which
is identical to the LM test statistic for the hypothesis ξ1 = · · · = ξp = 0 in the
auxiliary regression model (5.7). In this case the vectors z1,t and
z2,t are defined as
respectively
STAR model
Consider the STAR(2; p, p) model (2.42) with the transition function G(Yt−d ; γ, c) =
Φ(γ{Yt−d − c}), i.e.
p
p
Yt = φ0 + φi Yt−i + ξ0 + ξi Yt−i G(Yt−d ; γ, c) + εt . (5.16)
i=1 i=1
Note that the parameters γ, d (1 ≤ d ≤ p), and c are generally unknown. Hence,
(3)
under H0 , the STAR(2; p, p) model is not identified. Analogous to the LM-type
test statistic for the ExpAR(p) model one can solve this problem by replacing G(·)
by a suitable linear approximation. In fact, it turns out that LM-type test statistics
can be obtained for a wide class of smooth transition functions G(·) provided the
following conditions are satisfied (Luukkonen et al., 1988a).
160 5 TIME-DOMAIN LINEARITY TESTS
(a) The functions G(·) are odd, monotonically increasing, and possess a nonzero
derivative of order (2s + 1) in an open interval (−a, a), for a > 0, s ≥ 0.
(b) The functions G(·) are such that G(0) = 0 and (dk G(z)/dz k )|z=0 = 0 for k
odd and 1 ≤ k ≤ 2s + 1.
T1 (z) ≈ g1 z. (5.18)
Substituting (5.18) for G(zt ) ≡ G(Yt−d ; γ, c) into (5.16) yields the auxiliary linear
regression model
p
p
Yt = a0 + ai Yt−i + c0 (Yt−d − c) + ci ui,t + ηt , (5.19)
i=1 i=1
p
p
p
Yt = α0 + αi Yt−i + βij Yt−i Yt−j + ηt . (5.20)
i=1 i=1 j=i
(3∗ )
H0 : βij = 0, (i = 1, . . . , p; j = i, . . . , p). (5.21)
The steps for computing the corresponding LM-type test statistic are as follows.
5.1 LAGRANGE MULTIPLIER TESTS 161
(3∗ )
Algorithm 5.1: LMT test statistic
(i) Regress Yt on {1, Yt−1 , . . . , Yt−p } using LS; compute the residuals {
εt }Tt=1 ,
2
and the residual sum of squares SSE 0 = t εt .
(ii) Regress εt on {1, Yt−i , Yt−i Yt−j ; i = 1, . . . , p; j = i, . . . , p}; compute the re-
ηt }Tt=1 , and the residual sum of squares SSE 1 = t ηt2 .
siduals {
(3∗ ) D
LMT −→ χ21 p(p+1) , as T → ∞. (5.23)
2
Now, replacing G(·) in (5.16) by T3 (γ{Yt−d − c}) gives the auxiliary model
p
p
p
Yt = a0 + ai Yt−i + c0 (Yt−d − c) + ci ui,t + d0 (Yt−d − c) + 3
di wi,t + ηt ,
i=1 i=1 i=1
(3∗∗ ) D
LMT −→ χ21 p(p+1)+2p2 . (5.27)
2
Note that the above three LM-type test statistics do not assume that the delay
parameter d is known. If, however, if d is known, then it can be shown that the
(3∗ ) (3∗∗ ) (4)
number of degrees of freedom of LM T , LMT , and LMT are p, 3p, and p + 1,
respectively. In that case the resulting test statistics will be different from the ones
given above since the residual sum of squares SSE i (i = 1, 2, 3) will be based on far
fewer independent variables. Hence, prior knowledge about d can be quite valuable
in testing linearity against STAR(2; p, p) models.
where δj = θj− − θj+ . In addition, consider as a special case of the SETARMA model
(2.29), the SETMA(2; q, q) model given by
q
q
Yt = μ + εt + θj εt−j + δj I(Yt−d ≤ r)εt−j . (5.32)
j=1 j=1
A notable difference between (5.31) and (5.32) is that with (5.31) the regime switch-
ing is in {εt } whereas the threshold variable in the SETMA model is {Yt−d } (d ∈ Z+ )
itself. However, within the LM testing framework, this difference between both
models does not play a role in the development of a linearity test. Hence, below we
consider testing a linear MA model against an asMA(q) model. The procedure for
testing SETMA(2; q, q) types of nonlinearity is completely identical.
Define the parameter vectors θ = (θ1 , . . . , θq ) , δ = (δ1 , . . . , δq ) , and ψ =
(μ, θ , δ , σε2 ) , where θj ≡ θj+ . Furthermore, assume that there are q starting values
Y−q+1 , . . . , Y0 , and let {εt } ∼ N (0, σε2 ) which is needed to specify the log-likelihood
i.i.d.
function. For the asymptotic distribution of the LM-type test statistic this latter
assumption can be relaxed by requiring the existence of certain moments higher
than order two of the process {εt , t ∈ Z}. Given these specifications, it is apparent
from (5.31) that the null hypothesis of linearity is given by
(5)
H0 : δ = 0. (5.33)
(5)
Assume that under H0 the roots of θ(z) = 1 + k θk z k lie outside the unit
(5)
circle to guarantee (global) invertibility. To derive an LM-type test statistic of H0
we need the components of the gradient, or score, vector ∂LT (ψ)/∂ψ. They are
1
∂εt−k
T
∂LT (ψ)
=− 2 εt εt−j + θk + δk I(εt−k ≤ 0) , (j = 1, . . . , q),
∂θj σε ∂θj
t=1 k
(5.34)
164 5 TIME-DOMAIN LINEARITY TESTS
1
∂εt−k
T
∂LT (ψ)
=− 2 εt I(εt−j ≤ 0)εt−j + θk + δk I(εt−k ≤ 0) , (5.35)
∂δj σε ∂δj
t=1 k
∂LT (ψ) 1
T
∂εt−k
=− 2 εt 1 + θk + δk I(εt−k ≤ 0) , (5.36)
∂μ σε ∂μ
t=1 k
and
1 2
T
∂LT (ψ) T
=− 2 + 4 εt . (5.37)
∂σε2 2σε 2σε
t=1
(5)
Under H0 , (5.34) has the form
1 ∂εt−k
T
∂LT (ψ)
=− 2 εt εt−j + θk , (j = 1, . . . , q). (5.38)
∂θj σε ∂θj
t=1 k
From (5.38) it follows that (1 + k θk B k )(∂εt /∂θj ) = −εt−j (j = 1, . . . , q), so
that ∂εt /∂θj = −θ−1 (B)εt−j where B is the backward shift operator. Moreover,
∂εt /∂δj = −θ−1 (B)I(εt−j ≤ 0)εt−j (j = 1, . . . , q) and ∂εt /∂μ = −θ−1 (1) = con-
(5)
stant, under H0 . The actual testing can be performed by the following steps.
(5)
Algorithm 5.3: FT test statistic
(i) Estimate the parameters of the asMA(q) model (5.31) with δj = 0 (j =
1, . . . , q) consistently; compute the residuals {
εt }Tt=1 . The Hannan and Ris-
sanen (1982) procedure, based on first estimating a long AR, is recommended
for computing the MA parameters.
K
(ii) Regress εt on 1 and ξ(B) εt−j (j = 1, . . . , q), where ξ(B) = k=0 ξk B
k
(ξ0 = 1) is the Kth order approximation of θ−1 (B); compute the residuals
vt }Tt=1 , and SSE0 = t vt2 .
{
(5)
Under H0 , and as T → ∞,
(5) D
FT −→ Fν1 ,ν2 (5.40)
with ν1 = q and ν2 = T − K − 2q − 1.
5.1 LAGRANGE MULTIPLIER TESTS 165
Analogously,
∂εt ∂εt−k
= − I(εt−j ≤ 0)εt−j + θk , (j = 1, . . . , q),
∂δj ∂δj
k
∂εt ∂εt−k
=− 1+ θk ,
∂μ ∂μ
k
where the required initial values are set to zero. The second and third steps of the
testing procedure can be modified as follows.
(ii∗ ) Regress εt on ∂ εt /∂ θj (j = 1, . . . , q) to obtain {
and ∂
εt /∂ μ vt } and SSE0 .
ASTMA model
Consider the ASTMA model (2.45) which, for ease of exposition, we reproduce as
q
Yt = εt + θj + δj Gj (γεt−j ) εt−j . (5.41)
j=1
1 ∂εt−k ∂ε2
T
∂LT (ψ)
=− 2 εt θk + δk Gk (0)ε2t−k + γGk (0)δk t−k .
∂γ σε ∂γ ∂γ
t=1 k
(6)
Thus, under H0 ,
1 ∂εt−k
T
∂LT (ψ)
=− 2 εt θk + δk Gk (0)ε2t−k
∂γ σε ∂γ
t=1 k
and
∂εt q q
=− δk Gk (0)θ−1 (B)ε2t−k ≈ − δk Gk (0)ξ(B)ε2t−k . (5.43)
∂γ
k=1 k=1
This does not yield a practicable test because the resulting test statistic, say Fδ ,
depends on the unknown nuisance parameters δj (j = 1, . . . , q). We may, however,
replace SSEδ by inf δ SSEδ so that the test statistic becomes sup δ Fδ . The asymptotic
null distribution of supδ Fδ is χ2q . This is done by treating the q elements in the last
sum in (5.43) as separate variables and performing the following step.
(iii ) Regress vt on 1, ξ(B)
εt−j and ξ(B) 2
εt−j (j = 1, . . . , q); compute the residual sum of
∗ ∗
squares SSE . Replace SSE by SSE in step (iv) of Algorithm 5.3.
t−1 ; γj , ω 1 1
G(X j , cj ) = − , (j = 1, . . . , k), (5.45)
1+ exp(−γj [ t−1
ωj X − cj ]) 2
where
t−1 = (Yt−1 , . . . , Yt−q ) , and ω
X j = ( qj ) .
ω1j , . . . , ω
5.1 LAGRANGE MULTIPLIER TESTS 167
p
q
q
p−q
q
∗
Yt = α0 + αi Yt−i + βij Yt−i Yt−j + ψij Yt−i Yt−j
i=1 i=1 j=i i=1 j=1
q
q
q
p−q
q
q
∗
+ βiju Yt−i Yt−j Yt−u + ψiju Yt−i Yt−j Yt−u
i=1 j=i u=j i=1 j=1 u=j
q q q q
+ βijuv Yt−i Yt−j Yt−u Yt−v
i=1 j=i u=j v=u
p−q
q
q
q
∗
+ ψijuv Yt−i Yt−j Yt−u Yt−v + ηt , (5.46)
i=1 j=1 u=j v=u
where the vector Yt∗ ∈ Rp−q is formed by the elements of Xt−1 = (Yt−1 , . . . , Yt−p )
t−1 . The corresponding null hypothesis of linearity is
that are not contained in X
defined by
(7)
H0 : βij = 0, ψij = 0, βiju = 0, ψiju = 0, βijuv = 0, ψijuv = 0. (5.47)
q
q
q
p
p
p
Yt = α0 + αi Yt−i + βij Yt−i Yt−j + + βiju Yt−i Yt−j Yt−u +
i=1 i=1 j=i i=1 j=i u=j
p p p p
+ βijuv Yt−i Yt−j Yt−u Yt−v + ηt , (5.48)
i=1 j=i u=j v=u
(7)
with similar modifications in the specification of the null hypothesis H0 , and the
degrees of freedom of the resulting tests statistics. Given (5.46) and (5.47), a third-
order LM-type test statistic can be computed by the following steps.
168 5 TIME-DOMAIN LINEARITY TESTS
(7)
Algorithm 5.4: LMT test statistic
(i) Regress Yt on {1, Yt−1 , . . . , Yt−p } using LS; compute the residuals {
εt }Tt=1 ,
2
and the residual sum of squares SSE 0 = t εt .
(ii) Regress εt on {1, Yt−1 , . . . , Yt−p } and on each of the nonlinear regressors of
(5.46); compute the residuals { ηt }Tt=1 , and SSE2 = t ηt2 .
(7) D
LMT −→ χ2ν , (5.50)
where
q q q
ν= (q + 1) + (q + 1)(q + 2) + (q + 1)(q + 2)(q + 3)
2! 3! 4!
q q
+ (p − q) q + (q + 1) + (q + 1)(q + 2) .
2! 3!
The asymptotic properties of the above two test statistics do not crucially de-
pend on the assumption that the activation-level G(·) function is logistic, provided
conditions (a) and (b) given with the STAR model are satisfied. In practice, the
test statistic (5.51) is preferred over (5.49) since the asymptotic χ2ν distribution is
likely to be a poor approximation to the finite sample distribution of the LM-type
test statistic if the degrees of freedom ν is large.
p
p
(1) (1) (2) (2)
Yt = φ0 + φi Yt−i + φ0 + φi Yt−i I(Yt−d ≤ r) + εt . (5.52)
i=1 i=1
Suppose, for the moment, that p and d are known (1 ≤ d ≤ p). Further, we assume
that the unknown threshold parameter r takes a value inside a known bounded
closed subset of R, say R = [r, r], with r and r finite constants.
Let φi = (φ0 , . . . , φp ) (i = 1, 2), and θ = (φ1 , φ2 ) . We denote the parameter
(i) (i)
space by Θ = Θφ1 ×Θφ2 , where Θφ1 and Θφ2 are compact subsets of Rp+1 . Suppose
the true parameter vector θ0 = (φ10 , φ20 ) , is an interior point of Θ. The hypotheses
of interest are
(8)
H0 : φ20 = 0,
(8)
H1 : φ20 = 0 for some r ∈ R. (5.53)
(8) (8)
under H0 and H1 are, respectively,
T
T
L0T (φ1 ) = εt2 (φ1 ), and L1T (φ2 , r) = εt2 (φ2 , r), (5.54)
t=1 t=1
where εt (φ1 ) = εt (θ, −∞), and εt (φ2 , r) is defined based on the iterative equation
(5.52). For a given r, let
2T = arg min L1T (φ2 , r).
1T = arg min L0T (φ1 ) and φ
φ
φ1 ∈Θφ1 φ2 ∈Θ
(8) (8)
The quasi-LR statistic for testing H0 against H1 is then defined as
1T ) − LR1T φ
LRT (r) = LR0T (φ 2T (r), r .
Since r is unknown, a natural choice for a test statistic is sup r∈R LRT (r). This
choice, however, is undesirable since the test diverges to infinity in probability as
T → ∞. An appropriate alternative test statistic is
#
$
(8)
LRT = sup LR0T (φ1T ) − LR1T φ 2T (r), r 1T ).
/LR0T (φ (5.55)
r∈R
and
−1
Ω1 (r) = Σ21 (r) − Σ21 (r)Σ−1
22 (r)Σ12 (r) ,
170 5 TIME-DOMAIN LINEARITY TESTS
where Σ(·), Σ21 (·) = Σ12 (·), and Σ22 (·) are (p+1)×(p+1) matrices. Let {G 2(p+1) (r)}
denote a 2(p+1)-dimensional vector Gaussian process with zero mean and covariance
kernel Σ(r∧s) − Σ21 (r)Σ−1 Σ12 (r); almost all its paths are continuous. Then, under
(8)
H0 , standard regularity conditions, it can be shown (Chan, 1991) that
1 # $
sup G 2(p+1) (r)Ω1 (r)G 2(p+1) (r) , as T → ∞.
(8) D
LRT −→ 2
(5.57)
σε r∈R
Using the Poisson clumping heuristic (Aldous, 1989), it follows that the limiting
null distribution for the test statistic (5.57) is given by
# $
α
P sup G 2(p+1) (r)Ω1 (r)G 2(p+1) (r) ≤ α ∼ exp − 2χ2p+1 (α) −1
r∈R
p+1
p+1
dti
× dr , (5.58)
R dr
i=1
(8)
asymptotic distribution of LR T . In fact, its asymptotics also holds when {εt } ∼
2
WN(0, σε ); see, e.g., Chan (1990). Indeed, if this is the case we can treat (5.52)
as a regression model with the p + 1 vector of added variables Xt I(Yt−d ≤ r), with
Xt = (1, Yt−1 , . . . , Yt−p ) , and replace (5.55) by
#
$
sup SSE0 − SSE1 φ 2T (r), r
(8) r∈R
FT = T
, (5.60)
inf r∈R SSE1 φ2T (r), r
(8) (8)
where SSE0 and SSE1 (·) are the sum of squares of residuals under H0 and H1 ,
respectively.
5.2 LIKELIHOOD RATIO TESTS 171
Nested SETARs
It is straightforward to generalize the F test statistic (5.60) to a SETAR(k; p, . . . , p)
model (k ≥ 2). Let Xt = (1, Yt−1 , . . . , Yt−p ) be a (p + 1) × 1 vector. Using the
notation introduced in Section 2.6, a convenient way of writing the k-regime SETAR
model is
(i = 1, . . . , k).
When k = 1, (5.61) reduces to a linear AR(p), or a SETAR(1; p), model with
zero thresholds, being the most restrictive within the class of k-regime SETAR
models. The models within this class are strictly nested . This simply means that
the i-regime SETAR model being tested, the null hypothesis, is a special case of the
alternative SETAR(j; p, . . . , p) model (i < j; i = 1, . . . , k) against which it is being
tested. Here, we implicitly assume that there are no additional different constraints
on the parameters φi , and the delay d is the same for both models.
Suppose the parameters of (5.61) are collected in the vector θ = (φ1 , . . . , φk , r , d)
belonging to the parameter space Θ. The LS estimator, say θ, of θ solves the
minimization problem
T
#
k
(j)
$2
θ = arg min Yt − φj Xt It r ,d . (5.62)
θ∈Θ
t=1 j=1
Let SSEi be the residual sum of squares corresponding to an i-regime SETAR model.
Then the natural analogue of (5.60) for testing an i-regime SETAR against a j-regime
SETAR model is defined by
SSE − SSE
(i,j) i j
FT = T , (i < j; i = 1, . . . , k). (5.63)
SSEj
This is equivalent to the conventional LM-type test statistic (5.8).
We can solve the minimization problem (5.62) sequentially through concentra-
tion. For instance, for the case k = 2, minimization over φ = (φ1 , φ2 ) is an LS
where
MT (r, d) = X1 (r, d)X1 (r, d) − X1 (r, d)X1 (r, d) (X X)−1 X1 (r, d)X1 (r, d)) ,
with X1 (r, d) ≡ Xt It (r, d) and X is the T × (p + 1) matrix whose ith row is Xt .
(1)
(iii) Generate {ε∗t }Tt=1 random draws (with replacement) from the LS residuals
of the fitted SETAR(1; p) model.
(iv) With fixed initial values {Y0 , Y−1 , . . . , Y−p+1 }, recursively generate {Yt∗ }Tt=1
using the SETAR(1; p) model with θ1 . Select a new set R ∗ falling between
the r × 100 lower and r × 100 upper percentiles of the EDF of {Yt∗ }Tt=1 .
2 (1,i)
Hansen (1999) shows how to calculate the asymptotic distribution of FT for the case of a
stationary process with possibly heteroskedastic error terms. Several minor modifications in the
formula for the asymptotic approximation (5.65) are needed. Also, for this case, he proposes an
adjusted version of the bootstrap procedure.
5.2 LIKELIHOOD RATIO TESTS 173
Figure 5.1: ENSO phenomenon. Asymptotic and bootstrap distribution of the FT(1,2) test
statistic.
The above procedures, i.e. via the asymptotic null distribution and bootstrap-
ping, can be extended to the case of testing a two-regime SETAR model against
a three-regime SETAR model. Some caution is needed, however. The problem is
that under the null hypothesis, the parameter r1 has a non-standard asymptotic
distribution (Chan, 1993).
We illustrate the use of the test statistic (5.63) with an application to the
monthly ENSO series (T = 748) introduced in Example 1.4. After some
initial exploration, we set p = 5. The estimated AR(5) model is given by
where the sample variances of {εt } (i = 1, 2) are 4.72 × 10−2 (T1 = 455)
(i)
where the sample variances of {εt } (i = 1, 2, 3) are 5.69 × 10−2 (T1 = 140),
(i)
4.17 × 10−2 (T2 = 334), and 4.71 × 10−2 (T3 = 269) respectively. The FT
(1,3)
test statistic equals 38.21. Both the asymptotic and bootstrapped p-values are
0.09. So, there is insufficient evidence to reject the AR(5) model in favor of
(2,3)
the three-regime SETAR model. The FT test statistic equals 9.85, with a
large bootstrapped p-value. Thus, in summary, it appears that an appropriate
model for the ENSO data is the SETAR(2; 5, 5) model.
(1,2)
Figure 5.1 shows the asymptotic and bootstrap distributions of FT . For
(i,j)
fixed (r, d), the test statistic FT has an asymptotic χ2p+1 distribution. Its
density function is plotted for reference. Clearly, the χ26 distribution is highly
misleading relative to the other two distributions. The bootstrap procedure
properly approximates the asymptotic distribution in this case.
SETARMA model
Recall the SETARMA(2; p, p, q, q) model with delay d:
(1)
p
(1)
q
(2)
Yt = φ0 + φi Yt−i + φj εt−j
i=1 j=1
p
q
(1) (1) (2)
+ ψ0 + ψi Yt−i + ψj εt−j I(Yt−d ≤ r) + εt , (5.69)
i=1 j=1
where, following Li and Li (2011), we assume that εt = ηt σt , where {ηt } ∼ (0, σε2 ).
i.i.d.
where
T )/T
ε2 = LR0T (φ
σ
with φT = arg minφ∈Θ L0T (φ), and θT (r) = arg minθ∈Θ L1T (θ, r). Denote Ω(r)
φ
as in (5.56) with
Ω1 (r) = Ω−1 (r) − diag(Σ−1 , 0),
where Σ(·), Σ21 (·) = Σ12 (·), Σ22 (·), and 0 are (p + q + 1) × (p + q + 1) matrices,
and where εt (θ0 , r) is defined based on the iterative equation (5.69).
Let {G 2(p+q+1) (r), r ∈ R} denote a 2(p + q + 1)-dimensional vector Gaussian pro-
(θ0 ,r) ∂εt (θ0 ,s)
cess with zero mean and covariance kernel E{ε2t ∂εt ∂θ ∂θ }, and almost all its
(1)
paths are continuous. Assume that all roots of the polynomials 1 − pi=1 φi z i and
(2)
1 + qj=1 φj z j are outside the unit circle, and these polynomials are coprime. In
(1) (2)
addition, assume that the polynomials 1 − pi=1 ψi z i and 1 + qj=1 ψj z j are also
coprime. The coprime nature of the polynomials is necessary to uniquely identify
the parameters of the SETARMA model, i.e., the assumption makes the matrix Ω(r)
(9)
positive definite. Then, under H0 , some standard regularity conditions, comple-
mented with conditions on the moments of the random variable εt , it can be shown
(Li and Li, 2011) that, as T → ∞,
1 # $
sup G 2(p+q+1) (r)Ω1 (r)G 2(p+q+1) (r) .
(9) D
LRT −→ 2
(5.72)
σε r∈R
(9)
Because distribution theory is not available for the LR T test statistic for general
SETARMA models, classical bootstrap methods can in principle be used to obtain p-
values. However, computing time will be huge if, for each bootstrap replicate, (5.71)
176 5 TIME-DOMAIN LINEARITY TESTS
where ξT (r) = √1T Tt=1 εt ∂εt ∂θ
(θ0 ,r)
. Clearly, the quantity ξT (r)Ω1 (r)ξT (r) is a quad-
ratic form. Provided any possible dependence on the threshold structure in a ob-
(9)
served time series is removed first, we can obtain a bootstrap approximation of LR T
by randomly permuting the summand in ξT (r). In particular, the bootstrapping
takes place as follows.
(9)
Algorithm 5.6: Bootstrapping p-values of LRT statistic
T +n i.i.d.
(i) Generate {εt }t=1 ∼ N (0, 1) random draws, with n the number of initial
observations. Generate {Yt }Tt=1
+n
from a SETARMA(2; p, p, q, q) model, with
or without possible dependence structure in the errors, using {εt }.
(iii) Fit an ARMA(p, q) model to {Yt }Tt=1 . Denote the resulting estimate of φ by
φ T ) = T ε2 (φ
T . Compute LR0T (φ T ).
t=1 t
(vi) Generate a sequence {ε∗t } of i.i.d. random variables with mean zero, variance
unity, and finite fourth moment. Suggested distribution functions are N (0, 1)
and the Rademacher distribution, which takes values ±1 with probability 0.5.
(vii) Let εt = εt (θT ( r), r). Remove any possible threshold structure in a
time series by generating Yt = θ Z t + εt , where Z
t = (1, Yt−1 , . . . , Yt−p ,
εt−1 , . . . , εt−q ) with εt = 0 for t ≤ 0.
(viii) Select a new set R ∗ falling between the r × 100 lower and r × 100 upper
percentiles of the distribution of {Yt }. Let r be the new threshold parameter.
5.2 LIKELIHOOD RATIO TESTS 177
(9)
Algorithm 5.6: Bootstrapping p-values of LRT statistic (Cont’d)
∂
εt (r) q
(2) ∂
εt−j
t −
= −Z φj ,
∂φ j=1
∂φ
εt (r) ∂ εt (r)
q
εt (r)
∂ (2) ∂
εt−j (r) ∂ εt (r) ∂
t I(Yt−d ≤ r) −
= −Z φj , = , ,
∂ψ j=1
∂ψ ∂θ ∂φ ∂ψ
where the necessary initial values in the recursions are set to zero. Moreover,
as an estimator of Ω(r), compute the outer product of the vector functions,
T
i.e. Ω(r) = T1 t=1 ( ∂ε∂θ
t (r) ∂
εt (r)
∂θ ).
T
(x) Compute the vector function ξT (ε∗ , r) = √1T t=1 ε∗t εt ∂ε∂θ t (r)
, and the stat-
istic
ξT (ε∗ , r) Ω −1 (r) − diag(Σ −1 , 0) ξT (ε∗ , r)
LRT (ε∗ , r) =
(b)
,
ε2 σ
σ ε2∗
T
ε2 = T −1 LR1T (θT (
where σ ε2∗ = T −1
r), r) and σ ∗ 2
t=1 {εt } .
(xii) Repeat steps (ix) – (xi) for different values of r. Compute LRT (ε∗ ) =
(b)
1 (9)
B
r) < LRT (ε∗ ) .
(b)
I LRT (
B
b=1
can be improved (see Chapter 6), it can well serve as a benchmark for testing
the ARMA(1, 1) model against a SETARMA(2; 1, 1, 1, 1) model with delay
d ∈ [1, . . . , 6]. Setting B = 10,000, r = 0.1, r = 0.9, and generating {ε∗t } (step
(vi)) from an N (0, 1) distribution, we fitted various two-regime SETARMA
(9)
models to the data. For d = 2 the p-value (0.049) of the LRT test statistic is
smaller than the 5% nominal significance level. The associated model is given
by
The third classical test, the Wald (W) test, is based exclusively on the unres-
tricted estimates θ of θ. Assume that the ARasMA model is invertible, and let
{εt } ∼ N (0, σε2 ). Then, for the unrestricted model, the log-likelihood function at
i.i.d.
1 2 1
t (θ) = − ε (θ) − log σε2 , (5.78)
2σε2 t t 2
Let εt ≡ εt (θ). Then the score vector at time t is given by Gt (θ) = ∂t (θ)/∂θ =
−σε2 εt ∂εt /∂θ, where
∂εt φ0 .. φ φ .. + + .
.
= − 1 + vt,1 . Yt−1 + vt,1 · · · Yt−p + vt,p +
. εt−1 + vt,1 · · · ε+
t−q + vt,q .
∂θ
ε− − − −
t−1 + vt,1 · · · εt−q + vt,q ,
with
q
θ
vt,j = θk+ I(εt−k > 0) + θk− I(εt−k ≤ 0) ∂εt−k /∂θj .
k=1
Here, the superscript on vt together with the second subscript indicate the appro-
priate element within the θ vector. The empirical Hessian H T associated with the
log-likelihood function can be approximated by the summed outer product of Gt ,
T = T Gt G . Let θ be the vector of parameter estimates of θ, and H
i.e. H −1 (θ)
t=1 t T
the estimate of the corresponding covariance matrix. Then the W test statistic can
be expressed as
−1
WT
(10)
= R2 θ RH −1 (θ)R
R2 θ. (5.79)
T
(10)
Under H0 , and as T → ∞, (5.79) has an asymptotic χ2q distribution.
Thus H1 is quite general. Therefore the resulting test statistics are often termed
portmanteau-type tests.
Obviously, if {Yt , t ∈ Z} is linear, i.e., if ψuv = 0 ∀u, v, then εt will be independent
of εt−u εt−v . If, however, {Yt , t ∈ Z} is nonlinear, i.e., if any of the second-order
coefficients ψuv are non-zero, this is not so. Then this nonlinearity will be reflected in
the relationship of the residuals of a fitted linear model with, for instance, Yt−1 Yt−2 ,
a quadratic nonlinear term. This is called the added variable approach. Below, we
discuss three variants.
one degree of freedom test for nonadditivity in analysis of variance. The mechanisms
for computing the test statistic are as follows.
(ii) Regress {Yt 2 } on {1, Yt−1 , . . . , Yt−p }; compute the residuals {ξt }Tt=p+1 .
(T) η 2
FT = , (5.81)
(SSE − η 2 )/(T − 2p − 2)
1/2
where η = η0 2 with η0 the regression coefficient in step (ii).
t ξt
(T) D
Under H0 , and as T → ∞, FT −→ Fν1 ,ν2 with ν1 = 1 and ν2 = (T − p) −
(p + 1) − 1. The estimated size of (5.81) can be improved by using T − p
(T)
instead of T − 2p − 2 in the denominator of FT (Luukkonen et al., 1988b).
This improvement also applies to the next two F test statistics.
(T)
Keenan (1985) shows that FT is approximately distributed as χ21 but the F -
version may be preferred in practice because it is computationally convenient and
reasonably powerful in finite samples. An advantage of (5.81) is that it is easy and
quick to implement involving little subjective choice of parameters. On the other
(T)
hand, the FT test statistic is only valid for the Volterra expansion, but not all
nonlinear processes possess this expansion.
Original F test
This F test statistic is a direct modification of the original (O) Tukey nonadditivity-
type test statistic (5.81), and hence its name; see Tsay (1986).3 The test considers
the residuals of regressions that include the individual nonlinear terms and quad-
ratic terms up to third order {Yt−12 ,Y
t−1 Yt−2 , . . . , Yt−1 Yt−p , Yt−2 , Yt−2 Yt−3 , . . . , Yt−p }
2 3
(T)
while FT considers the residuals of regressions on only the squared terms.
Let Xt = (Yt−1 , . . . , Yt−p ) , and define the P = 12 p(p + 1)-dimensional vector
Zt = vech(Xt Xt ). Further, assume that {εt } ∼ WN(0, σε2 ) with E(ε4t ) < ∞. The
procedure for performing the original F test statistic is outlined in the following
steps.
3
The name given to this test statistic is taken from Tsay (1991). This reference serves also as
the source for the names given to the original, the augmented, and the new F test statistic (Section
5.5) which are discussed below.
5.4 TESTS BASED ON A SECOND-ORDER VOLTERRA EXPANSION 181
(O)
Algorithm 5.8: FT test statistic
(i) Choose an appropriate even value of p, e.g. p = 4 or p = 8. Regress Yt on
{1, Yt−1 , . . . , Yt−p }; compute the residuals {
εt }Tt=p+1 .
(ii) Regress the first p + 1 elements of Zt on {1, Yt−1 , . . . , Yt−p } and obtain the
residuals {ξ1,t }Tt=p+1 .
(iii) Then regress the next p + 1 elements of Zt on {1, Yt−1 , . . . , Yt−p } and obtain
the residuals {ξ2,t }Tt=p+1 .
(iv) Continue with steps (ii) – (iii) until the residuals from all p/2 regressions have
been obtained. From these residuals, form the (p/2) × 1 vector {ξt }Tt=p+1 .
2
(v) Regress εt on ξt ; compute the residual sum of squares t ω
t .
(O)
(vi) From the regression in (v) calculate the test statistic FT as the F ratio of
the mean square of regression to the mean square error, i.e.
−1
(O ) ( t ) ( t ξt ξt ) ( t ξt εt )/P
tε
t ξ
FT = . (5.82)
tω t2 /(T − p − P − 1)
(O) D
Under H0 , and as T → ∞, FT −→ Fν1 ,ν2 with degrees of freedom ν1 =
p(p + 1)/2 and ν2 = T − 2 p(p + 3) − 1; Tsay (1986).
1
(O)
Note that the test statistic P FT is asymptotically distributed as χ2P . Using the
LM testing procedure of Section 5.1, it can be easily shown (Luukkonen et al., 1988a)
that both tests (5.81) and (5.82) are LM-type test statistics. Simulation results show
(O) (T)
that the FT is more powerful than the FT test statistic in identifying BL-type
nonlinearity.
Augmented F test
(O)
The augmented (A) F test (Luukkonen et al., 1988a) extends the FT test statistic
by including the regression of the cubic terms {Yt3 } on (1, Yt−1 , .
. . , Yt−p ) in the set
of regressions in steps (ii) – (iv) of Algorithm 5.7. The (p/2) + 1 th set of residuals
{ξ(p/2)+1,t }Tt=p+1 are included in ξt . Call the resulting vector ξt . Perform a linear
(A)
regression of εt on ξt , and obtain the residual sum of squares t {ω
(A) (A)
t }2 . Then
the associated F test statistic is given by
−1
(A) εt
ξ (A) (ξ(A) )
ξ (A) εt /P
ξ
(A) t t t t t t t
FT = (A) 2
. (5.83)
t {
ωt } /(T − p − P − 1)
(A) D
Under H0 of linearity, and as T → ∞, FT −→ Fν1 ,ν2 , where ν1 = 12 p(p + 1) + p
and ν2 = T − p(p + 3)/2 − 2p. Clearly, if p = 1, the asymptotic distribution of
182 5 TIME-DOMAIN LINEARITY TESTS
Given the set of observations {Yt }Tt=1 , the threshold variable Yt−d can assume the
−d
values {Yi }Ti=h , where h = max{1, p + 1 − d}. Let τj be the time index of the
−d
jth smallest observation among {Yi }Ti=h . Assume that the recursive autoregressions
begin with a minimum number of start-up values, say nmin > p + 1. Denote the
−d−h+1
resulting ordered time series by {Yτj }Tj=n min +1
. Then we can write (5.84) as
(1) (1)
φ0 + pi=1 φi Yτj +d−i + ετj +d , (j = nmin +1 , . . . , s),
Yτj +d = (2) (2) (5.85)
φ0 + pi=1 φi Yτj +d−i + ετj +d , (j = s + 1, . . . , T − d − h + 1),
The predictive residuals ετm+1 +d and standardized predictive residuals eτm+1 +d are
given by
(1)
The LS estimates for the coefficients φu (u = 1, . . . , p) are consistent if there are
a large number of observations in the first regime. Moreover, the predictive residuals
are asymptotically WN and independent of the regressors. When, however, j arrives
at and exceeds s, the predictive residuals for the observation with index τs+1 + d will
become biased as a result of the model change at time τs+1+d , and the predictive
residuals now become a function of the regressors {Yτj +d−i ; i = 1, . . . , p}. That
is to say, the independence between the predictive residuals and the regressors is
destroyed once the arranged autoregression includes observations whose threshold
value exceeds r. In other words, there is a change at an unknown time-point in
the cumulative sums of the standardized predictive residuals. This calls for a test
statistic having its roots in the analysis of change-points. Typically, the first test
statistic discussed below uses the change-point framework. The mechanics of the
next two test statistics are based on the properties of the one-step ahead predictive
residuals.
(ii) Then, for nmin ≤ r ≤ T − p, find the recursive LS estimates; compute the
standardized predictive residuals eτj +d (j = nmin + 1, . . . , T − d − h + 1; h =
max{1, p + 1 − d}).
j
(iii) Compute the cumulative sums Zj = i=nmin +1 ei , (j = nmin + 1, . . . , T −
d − h + 1), and the associated CUSUM test statistic
√
QT = max |Zj |/ T ∗ , (5.90)
nmin +1≤j≤T −d−h+1
184 5 TIME-DOMAIN LINEARITY TESTS
(iii) (Cont’d)
where T ∗ = T − d − h + 1 − nmin . Clearly, this is a Kolmogorov–Smirnov
type statistic. Under mild conditions on the noise process {εt }, it follows
(MacNeill, 1971) that the limiting distribution of QT is given by
√
P (QT / T ∗ ) α = Δα
∞
!
"
≡ (−1)j Φ α(2j + 1) − Φ α(2j − 1) , (5.91)
j=−∞
where Φ(·) is the normal distribution function, and α the nominal significance
level.
(iv) Some upper quantiles are 0.2309 (90%), 0.3011 (92.5%), 0.3245 (95%), 0.3478
(97.5%), and 0.3616 (99%); see Grenander and Rosenblatt (1984, Chapter
6, Table 1) for a partial tabulation. If QT > Δα , then we reject the null
hypothesis of linearity.
It is fairly obvious that the CUSUM test statistic is very simple to implement
since it does not require the estimation of the SETAR model under the alternative
hypothesis. The test statistic can be used to determine both the number and location
of the thresholds. To avoid underfitting, it is recommended to iterate the recursive
LS estimation procedure for different pairs (d, p).
(ii) Compute a second regression with the predictive residuals on Yτj +d ; i.e.
p
eτj +d = β0 + βi Yτj +d−i +ωτj +d , (j = nmin + 1, . . . , T − d − h + 1).
i=1
Simulation studies show that the TAR F test statistic has consistently higher em-
pirical power than the portmanteau CUSUM test statistic.
(iii) Regress εt on {1, Yt−1 , . . . , Yt−p }, {Yt−i εt−i , εt−i εt−i−1 } (i = 1, . . . , p), and
{Yt−1 exp(−γYt−1 ), Φ(zt−d ), Yt−1 Φ(Yt−d )}, where zt = (Yt−d − Ȳd )/sd with
Y d , sd are the sample mean and standard deviation of the Yt−d , respectively.
2
Calculate the residual sum of squares from this regression, SSE 1 = t ω t .
(N) D
FT −→ Fν1 ,ν2 ,
where {εt } ∼ N (0, σε2 ). Thus, the adequacy of the model under H0 is tested versus
i.i.d.
Hence, if H0 is true, the difference between the two residual sums of squares
should be small if T is sufficiently large, and t ε1,t should be small. On the
2
2 2
other hand, if Ha is true t ε1,t should be large while t ωt should be small.
that the actual DGP is a member of that family, but does not specify which one.
This latter situation occurs when artificial, or surrogate,4 data are created with MC
simulation methods. Surrogate data sets are often used in studies of nonlinear dy-
namical systems; see, e.g., Theiler et al. (1992), and Theiler and Prichard (1996) for
further insights into this topic.
4
Surrogate data have no dynamical nonlinearities. By construction a surrogate is equivalent
to passing i.i.d. Gaussian WN through a linear filter that reproduces the linear properties of one
realization of the strictly stationary process {Yt , t ∈ Z}.
5.8 ADDITIONAL BIBLIOGRAPHICAL NOTES 189
Section 5.2: Asymptotic critical values of the LR test statistic for SETMA(2; q, q) models
with d > q are the same as that of test statistics for change-points in Andrews (1993).
Empirical implementations of the LR testing approach are reported by K.S. Chan and Tong
(1986). Ling and Tong (2005) suggest a computationally intensive bootstrap method to
calculate p-values of a quasi-LR test for SETMA(2; q, q) models with d < q. Li and Li
(2008) generalize the test in Ling and Tong (2005) to a quasi-LR test statistic for TMA
models with GARCH errors.
Hansen (2000) recommends inverting the LR test statistic to construct confidence intervals
for the threshold parameter of a SETAR process. If the error process in (5.61) is conditionally
(1,i)
heteroskedastic, it is necessary to replace the FT test statistic with a heteroskedasticity-
consistent Wald or LM-type test statistic; Hansen (1997).
Chen et al. (2012b) propose a LR test statistic to determine the number of regimes in SETAR
models with two regimes.
Section 5.3: The Wald test statistic for symmetry of ARasMA models is due to Brännäs
and De Gooijer (1994). For asMA(1) models, the size properties are best for the LM-type
test statistic followed by, in order, the Wald and LR test statistics. The latter two tests are
more powerful than the LM-type test statistic; see also Brännäs et al. (1998).
Testing for a linear (near) unit root against (stationary) TAR models is the topic of a large
number of papers in the econometrics literature. For instance, Caner and Hansen (2001)
propose a Wald statistic for testing a two-regime SETAR with stationary but unknown
threshold parameter, Enders and Granger (1998) focus on an F test statistic for an M–TAR
model with known threshold parameter, Lanne and Saikkonen (2002) introduce a stability
test statistic for a TAR model with threshold effects only in the intercept term, Kapetanios
and Shin (2006) consider a Wald statistic for testing a three-regime SETAR model with a
random walk in the middle regime. Pitarakis (2008) comments on the limiting distribution
of the Wald test statistic in Caner and Hansen (2001). Bec et al. (2008) propose a SupWald
test statistic for SETARs with an adaptive set of thresholds, and Seo (2008) considers a
residual-based block bootstrap algorithm for testing the null hypothesis of a unit root in
SETARs.
Charemza et al. (2005) introduce a Student t-type test statistic for detecting unit root
bilinearity in a simple BL(1, 0, 1, 1) process. The linearity coefficient in this model may
be estimated by the Kalman filter algorithm, following an approach suggested by Hristova
(2005).
Section 5.4: The RESET test statistic of Ramsey (1969) may be viewed as an earlier, and
more general, version of the Tukey nonadditivity-type test statistic.
Section 5.5: It is easy to verify that (5.91) is identical to the approximate large sample
distribution given by Petruccelli and Davies (1986). Petruccelli (1990) introduces another
CUSUM test statistic for linearity using the reversed predictive residuals, denoted by QTrev
in Table 5.2. Similarly, Sorour and Tong (1993) examine the performance of the LR test
statistic for SETAR and the CUSUM test statistics in building a TARSO model.
Tong and Yeung (1990, 1991b) apply the CUSUM tests (original and reversed) and the TAR
F test statistic to investigate nonlinearities in partially observed time series; see also Tsai
and Chan (2000, 2002).
Following the basic structure of Algorithm 5.10, Liang et al. (2015) propose an F -type test
statistic for testing linear MA models versus (rearranged) SETMA models. The procedure
190 5 TIME-DOMAIN LINEARITY TESTS
requires the subjective use of scatter plots to identify the number and locations of potential
threshold values. The MA order follows from inspection of the sample ACF.
Section 5.6: Many studies have been performed investigating power properties of the test
statistics considered in this Chapter. Important contributions published prior to the year
1992 are summarized in the review paper by De Gooijer and Kumar (1992, Exhibit 1).
Teräsvirta et al. (1993) study and compare the power of LM-type and ANN test statistics
(see also Lee et al., 1993). de Lima (1997) investigates the robustness of several portmanteau-
type nonlinearity test statistics (e.g. Hinich’s bispectrum test) to moment condition failure.
More recently, Vavra (2013, Chapter 2) examines the robustness of eight nonlinearity test
statistics against non-Gaussian innovations by MC simulation. Overall, there is no clear link
between the performance of the test statistics and their moments requirements. However,
some of the test statistics are not very trustworthy for DGPs with heavy-tailed innovations.
5
RATS, also called WinRATS, is a registered trademark of Estima, Inc.
APPENDIX 5.A 191
Appendix
t∗ 1
P sup |Ut | > z ∼ (2/π)1/2 exp(−z 2 /2) t∗ z − + , (5.96)
0≤t≤t∗ z z
where
1 b(1 − a)
t∗ = log , (0 < a < b < 1),
2 a(1 − b)
and {Ut } is a so-called stationary Ornstein–Uhlenbeck process with E(Ut ) = 0 and E(Us Ut ) =
exp(−|t − s|).
Tables 1 and 2 in Chan (1991) contain upper 10%, 5%, 2.5%, 1% and 0.1% percentage
(8)
points for the null distribution of the LR T test statistic for 0 ≤ p ≤ 18 and (a, b) =
(0.25, 0.75) and (0.1, 0.9). For p = 0, it can be seen that the percentage points are close
to that of a χ23 distribution, which also follows from comparing (5.96) with the asymptotic
distribution function P(χ23 > z 2 ) ∼ (2/π)1/2 exp(−z 2 /2)(z + z1 ).
from the sample mean, is not recommended since then the asymptotic null distributions are
no longer valid.
Some additional remarks are in order:
(i) With the test statistics QT , QTrev and LRT one must fix p and d. The selection of
the order p can be done via, e.g., AIC. Also, the number of thresholds need to be
pre-specified.
(ii) The selection of the added variables with many of the LM-type and F -type test
statistics is somewhat arbitrary. For example, one uses p added variables specifically
for the ExpAR(p) model and p + 1 for the STAR(2; p, p) model.
(iii) Test statistics based on the recursive LS method require a minimum number of ob-
servations nmin used to start the method. However, nmin depends on the order p and
the sample size T .
(iv) The recursive estimation can be done via various algorithms such as the one given by
(5.86) – (5.87), or by the Kalman filter. The latter method appears to be preferable
when there are missing observations in the data.
(v) The empirical power studies in Table 5.2 have been carried out under a wide variety of
alternatives (see the footnotes at the bottom of the table). No fixed set of DGPs has
been used across all studies with the same sample size. So, comparison of the reported
results is difficult. Moreover, power studies are criticized for the fact that test results
are determined by the sample size, i.e. as T increases the empirical power goes to one
under the alternative hypothesis. In contrast, local alternatives make its difference
APPENDIX 5.B 193
Table 5.2: Summary of size and power studies for some time-domain linearity test stat-
istics; equation numbers in parentheses refer to the particular test statistic in the main text.
BL(1) : (i) QT (5.90) 50, 100 •marginally Petruccelli and Davies (1986)
(T)
outperforms FT
(T)
FT (5.81) <200 • reasonable only Davies and Petruccelli (1986)
for extreme
BL-DGPs
>200 good for wide
range of
BL-DGPs
(T)
(ii) FT (5.81) 50, 100, • good Saikkonen and Luukkonen (1988)
200
(1) (T)
LMT (5.12) •outperforms FT
(1)
(iii) LMT (5.12) 50, 75, •good for Saikkonen and Luukkonen (1991)
100, 150 BL-DGPs
(O) (T)
(iv) FT (5.82) 70, 140, 204 •outperforms FT Tsay (1986)
(O)
(v) FT (5.82) , 100 •all tests have Tsay (1991)
(N) (A)
FT (5.93), FT (5.83) good power
(N)
ExpAR(2) : QT (5.90), FT (5.93), 100 •good Tsay (1991)
FT∗ (5.92) • not powerful
(2)
LMT (5.15) 50, 100, •outperforms Saikkonen and Luukkonen (1988)
(T) (1)
200 FT and LMT
SETAR(3) : (i) QT (5.90) 50, 100 •less powerful Petruccelli and Davies (1986)
(T)
than FT
(ii) QT (5.90), 50, 100, •less powerful Moeanaddin and Tong (1988)
150, 200, than QTrev
250
(8)
FT (5.60) 50, 100 •outperforms QT
and QTrev
rev (3∗∗ ) (8)
(iii) QT and LMT (5.26) 100 • outperforms FT Petruccelli (1990)
and FT∗
(T)
(iv) FT (5.81) <100 • reasonable only Davies and Petruccelli (1986)
for nearly
nonstationary
DGPs
>100 more satisfactory
(v) FT∗ (5.92) 50, 100 • outperforms QT Tsay (1989)
(3∗ ) (3∗∗ ) (4)
(vi) LMT (5.22), LMT 50, 100 • LMT is more Luukkonen et al. (1988b)
(4) (3∗ )
(5.26), LMT (5.29) powerful; LMT
and QT are poor
(O) (A)
LSTAR(4) : (i) FT (5.82), FT (5.83), 100 •all tests have Tsay (1991)
(N)
QT (5.90), FT (5.93), low power
(3∗ ) ∗∗ (3)
(ii) LMT (5.22), LM(3 ) 50, 100 • LMT is inferior Luukkonen et al. (1988a)
(4) (3∗ )
(5.26), LMT (5.29) to LMT and
(4) (3)
LMT ; LMT ,
QT low power
(1) (i) Yt = (φ + ψεt )Yt−1 + εt ; (ii) (2.13); (iii) Yt = μ + ψεt−1 Yt−i + εt (i = 1.2);
(iv) Yt = εt − 0.4εt−1 + 0.3εt−2 + 0.5εt εt−2 ; (v) Yt = 0.5Yt−1 + ψYt−1 εt−1 + εt and
Yt = εt + 0.5εt−1 + ψε2t−1 .
(2) Y = {φ + ξ exp(−Y 2 )}Y
t t−1 t−1 + εt .
(3) (i) SETAR(2; 1, 1) (no intercept); (ii) SETAR(2; 1, 1) (no intercept); (iii) SETAR(2; 1, 1),
SETAR(2; 3, 2) and SETAR(3; 1, 1, 1) (all with intercept); (iv) SETAR(2; 1, 1) (no intercept);
(v) SETAR(2; 1, 1) (no intercept); (vi) SETAR(2; 1, 1) (with intercept).
(4) (i) Y = 1 − 1 Y + (φ + ξYt−1 )G(γYt−1 ) + εt with G(z) = 1/(1 + exp(−z));
t 2 t−1
(ii) Yt = − 12 Yt−2 − φYt−2 G( 12 Yt−1 ) + εt with G(z) = 1/(1 + exp(−z)).
194 5 TIME-DOMAIN LINEARITY TESTS
with the null hypothesis shrink as T increases. Only a few papers investigate the
local power of linearity tests; see, e.g., Guégan and Pham (1992) for the LM-type test
statistic against a general diagonal BL model.
Exercises
Theory Questions
(1,2)
5.1 Let γY 2
() = Cov(Yt , Yt− ) denote the bicovariance at lag of a time series {Yt , t ∈
i.i.d.
Z} generated by an MA() model with mean E(Yt ) = 0, and with {εt } ∼ N (0, σε2 ).
(1,2)
Given an observed time series {Yt }Tt=1 , the moment estimator of γY () equals
Y () = (T − )−1 t=+1 Yt Yt−
(1,2) T (1,2)
γ 2
. Under the null hypothesis H0 : γY () = 0
( = 1, 2, . . .), Welsh and Jernigan (1983) show that, as T → ∞, the large sample
distribution of the standardized bicovariance is given by
T D
WJ = 2
Yt Yt− / 3(T − ) −→ N (0, 1).
t=+1
Show that the WJ test statistic is a special case of the LM-type test statistic of testing
an MA(k) model against an ASTMA(k) model.
5.2 Suppose that the T × 1 vector of observations y = (Y1 , . . . , YT ) satisfies the asAR(p)
model
p
i.i.d.
Yt = φi + αi I(εt−i ≥ 0) Yt−i + εt , {εt } ∼ N (0, σε2 ).
i=1
H∗0 : α = 0 and β = 0.
Simulation Question
(1,2)
5.4 In this exercise we evaluate by simulation the power of the FT test statistic, defined
by (5.63), under model-selection uncertainty. The SETAR(2; 2, 2) model for the ob-
served time series is formulated as
(1) (1) (1)
φ0 + φ1 Yt−1 + φ2 Yt−2 + εt if Yt−2 ≤ 0,
Yt = (2) (2) (2)
φ0 + φ1 Yt−1 + φ2 Yt−2 + εt if Yt−2 > 0,
i.i.d.
where {εt } ∼ N (0, 1). Consider the following two DGPs:
(1) (1) (2) (2) (1) (2)
(i) φ0 = 0.5, φ1 = −φ1 = 0.2, φ0 = 0.3, φ2 = −φ2 = −0.1; and
(1) (1) (2) (1) (2)
(ii) φ0 = 0.5, φ1 = −φ1 = −0.1, φ2 = −φ2 = 0.1.
(a) For T = 200 and 500, generate 2,000 MC replications of the DGPs (i) and (ii).
(1,2)
Next, compute the empirical power of the FT test statistic, at the 5% nominal
significance level, using (i) a correctly specified SETAR model (setting the true
lag length at two), and (ii) the AIC and BIC order selection criteria (setting
the maximum allowed lag order pmax = 6). You should find the results given in
Table 5.3 (approximately).
• Then select the AR(p) model if minp AIC(p) < minp,r,d SC(p, d; r) (1 ≤ p ≤
d ≤ p).
pmax , r ∈ R,
A similar approach can be based on BIC with C(T ) = log T .
For T = 200 and 500, generate 1,000 replications of the DGPs (i) and (ii). Next,
apply the above two model-selection approaches (AIC and BIC) and record the
number of correct decision frequencies. Table 5.4 provides a summary of the
results you will find.
Compare and contrast the results in Tables 5.4 and 5.3.
Chapter 6
MODEL ESTIMATION, SELECTION, AND
CHECKING
Model estimation, selection, and diagnostic checking are three interwoven compon-
ents of time series analysis. If, within a specified class of nonlinear models, a par-
ticular linearity test statistics indicates that the DGP underlying an observed time
series is indeed a nonlinear process, one would ideally like to be able to select the
correct lag structure and estimate the parameters of the model. In addition, one
would like to know the asymptotic properties of the estimators in order to make
statistical inference. Moreover, it is evident that a good, perhaps automatic, order
selection procedure (or criterion) helps to identify the most appropriate model for
the purpose at hand. Finally, it is common practice to test the series of standardized
residuals for white noise via a residual-based diagnostic test statistic.
In this chapter, we focus on these three themes within the context of parametric
nonlinear modeling. Specifically, we consider the class of identifiable parametric
stochastic models
where
Assume that {εt } has density function fε (·). Given Y0 , the (conditional) likelihood
function evaluated at θ ∈ Θ, is equal to
T
1 Y − μ (θ )
t t g
LT (θ) = fε ,
σt (θ h ) σt (θ h )
t=1
assuming σt (θ h ) = 0.
The above objective function is not operational because fε (·) and Y0 are gen-
erally unknown. The initial values can be replaced by some fixed constants, e.g.,
zeros. More generally, one can treat Y0 and ε0 as unknown, additional, parameter
vectors and estimate them jointly with other parameters. This approach requires
more intensive computation. In finite samples, it may result in different parameter
estimates, but it will not affect the asymptotic properties of the estimator of θ 0 .
Replacing fε (·) by the N (0, 1) density function, and approximating μt (θ g ) by
t (θ g ) = g(Yt−1 , . . . , Y1 , 0, . . . ; θ g ) and σt (θ h ) by σ
μ t (θ h ) = h2 (Yt−1 , . . . , Y1 , 0, . . . ; θh ),
the minimizer θ T of LT (θ) is called the quasi ML (QML) estimator of θ 0 . That is,
T = arg min Q
θ T (θ), (6.2)
θ∈Θ
where
T Y − μ t (θ g ) 2
T (θ) = 1
Q t and t ≡ t (θ) =
t
t2 (θ h ),
+ log σ
T t (θ h )
σ
t=1
with t the log-likelihood function at time t. Furthermore, if σ t2 (θ h ) ≡ σ02 > 0, i.e. a
constant, the QML estimator coincides with the classical NLS estimator.
It is known that a solution to (6.2) exists when the parameter space Θ is compact,
and the functions θ g → μ t (θ g ) and θ h → σt (θ h ) are continuous. Moreover, under
some regularity conditions, it follows that the QML estimator is strongly consistent,
and asymptotic normally distributed; see, e.g., Tjøstheim (1986b). More precisely,
2
with t (θ) = Yt − μt (θ g ) σt−2 (θ h ) + log σt2 (θ h ), and as T → ∞,
√
T − θ 0 ) −→
T (θ
D
N 0, H−1 (θ 0 )I(θ 0 )H−1 (θ 0 ) , (6.3)
where
∂ 2 (θ ) ∂ (θ ) ∂ (θ )
t 0 t 0 t 0
H(θ 0 ) = E , and I(θ 0 ) = E .
∂θ∂θ ∂θ ∂θ
Here H(·) denotes the expected Hessian matrix , and I(·) is the expected information
matrix with t (·) evaluated at θ 0 .
T are obtained
Consistent estimates of the standard errors of the QML estimator θ
as the square root of the diagonal elements of the estimated covariance matrix of
T , that is
θ
, θ
Var( T ) = 1 H −1 H
TI T −1 ,
T
T
200 6 MODEL ESTIMATION, SELECTION, AND CHECKING
where the empirical Hessian and average information matrix for a sample of size T
are defined as, respectively,
T
∂ 2 t (θ
T )
T
∂ t (θ
T ) ∂ t (θ
T )
T = 1
H T = 1
I
, . (6.4)
T
t=1
∂θ∂θ T ∂θ
t=1
∂θ
T
∂ t (θ)
G(θ) = .
∂θ
t=1
T , es-
In practice, it is usually not possible to obtain an analytic solution for θ
pecially when the objective function involves many parameters. In such a situation,
estimates of θ 0 must be sought numerically using nonlinear optimization algorithms.
The basic idea of nonlinear optimization is to quickly find optimal parameters that
maximize the log-likelihood. This is done by searching much smaller sub-sets of the
multi-dimensional parameter space rather than exhaustively searching the whole
parameter space, which becomes intractable as the number of parameters increases.
Numerical optimization algorithms often involve the following steps.
(iii) Taking into account the results from step (ii), obtain a new set of estimates
T,i (i = 2, 3, . . .) by adding small changes to the previous estimates in
θ
such a way that the new parameter estimates are likely to lead to improved
performance.
(iv) Stop the iterative process in step (iii) if parameters estimates are judged to
have converged, using an appropriately predefined criterion. For instance, if
the relative improvement {Q( θT,i+1 ) − Q( T,i )}/Q(
θ θT,i ) is a small prefixed
number.
It is worth noting that the optimization algorithm does not necessarily guarantee
that the final estimate θT uniquely maximizes the log-likelihood. Even if G(θT ) ≈ 0,
the algorithm can prematurely stop and return a sub-optimal set of parameter values.
This is called the local maxima problem. Unfortunately, there exists no general
solution to the local maximum problem. Instead, a variety of remedies have been
6.1 MODEL ESTIMATION 201
developed in an attempt to avoid the problem (see, e.g., Teräsvirta et al., 2010,
Chapter 12), though there is no guarantee of their effectiveness. For example, one
may choose different starting values over multiple runs of the iteration procedure and
then examine the results to see whether the same solution is obtained repeatedly.
When that happens, one can conclude with some confidence that θ T is close to a
global optimum. If, however, the changes in the parameter estimates remain large
in multiple iterations the parameters of the model may not be identified.
To assess the performance of the QML estimator of θ 0 in finite samples, the next
example shows a simulation experiment.
Y = Xβ + ε,
Yt = {φ + ξ exp(−γYt−1
2
)}Yt−1 .
Consider model (6.5) with φ = −0.8, ξ = 2, γ = 2, and {εt } ∼ N (0, 1). So,
i.i.d.
φ − φ ξ − ξ −γ
γ φ − φ ξ − ξ −γ
γ
all white noise variances being equal. The latter model is defined as
⎧
⎪
p1
q1
⎪
⎪ φ
(1)
+ φ
(1)
Y + ε +
(1)
ψj εt−j if Yt−d ≤ r,
⎪
⎨ 0 i t−i t
i=1 j=1
Yt = (6.8)
⎪
⎪ p2 q2
⎪
⎪
(2) (2) (2)
⎩ φ0 + φi Yt−i + εt + ψj εt−j if Yt−d > r,
i=1 j=1
T
LT (θ) = ε2t (θ), (6.9)
t=1
204 6 MODEL ESTIMATION, SELECTION, AND CHECKING
where
p1
q1
(1) (1) (1)
εt (θ) = Yt − φ0 + φi Yt−i + ψj εt−j (θ) I(Yt−d ≤ r)
i=1 j=1
p2
q2
(2) (2) (2)
− φ0 + φi Yt−i + ψj εt−j (θ) I(Yt−d > r).
i=1 j=1
T = (
The CLS estimator θ of θ 0 are the values which globally minimize
τ T , rT , d)
(6.9), that is,
In practice, the vector of initial values Y0 is not available and can be replaced by
constants. This will not affect the asymptotic properties of θ T . For simplicity, we
assume hereafter that Y0 is from model (6.8). Since LT (θ) is discontinuous in r and
d, the minimization in (6.10) can be done as follows.
(ii) Since L∗T (r, d) takes finite possible values only, perform a grid search over
the set of order statistics {Y(1) , . . . , Y(T ) } of {Y1 , . . . , YT } and {1, . . . , D0 } to
get the minimizer ( rT , dT ) of L∗T (r, d).
rT , dT ) and θ
(iii) Use a plug-in method to obtain τ T ( T .
Generally, there are infinitely many values r at which LT (·) attains its global
minimum, the one with the smallest r can be chosen as the estimator of r0 . It is
T is the CLS estimator of θ 0 . For instance, with a SETAR(2; p, p)
easy to see that θ
model, simple computation shows that for a given value of r the CLS estimator of
θ0 is given by
T −1
T
θT (r) = Xt (r)Xt (r) Xt (r)Yt , (6.11)
t=1 t=1
where Xt (r) = (Xt I(Yt−d ≤ r), Xt I(Yt−d > r)) with Xt = (1, Yt−1 , . . . , Yt−p ) . With
(r)θT (r), the corresponding (conditional) residual variance
residuals εt (r) = Yt − Xt
is given by σ T (r) = T
2 −1 T
t2 (r).
t=1 ε
distributions of rT (a super-consistent estimator) and θT ; and (c) the convergence
rT −r0 ). A rigorous treatment of the conditions under which these authors
rate of T (
prove the above issues is beyond the scope of this book. However, in case of (c), we
introduce some notation to discuss the numerical method for tabulating the limiting
distribution of rT .
Consider the profile sum of squares errors function
T (z) = LT τ T r0 + z , r0 + z − LT τ T (r0 ), r0 , z ∈ R.
L
T T
Let e = (1, 0, . . . , 0) be a q × 1 vector, and
j
Ht,j (θ) = [ψ 2 + (ψ 1 − ψ 2 )I(Yt−d−i+1 ≤ r)], (j ≥ 0),
i=1
0
with the convention i=1 = Iq , and
(i) (i)
−ψ1 ··· −ψq
ψi = , (i = 1, 2).
Iq−1 0(q−1)×1
T
(1) z
T
(2) z
where
∞
∞
εt+j [e Ht+j,j (θ 0 )e] δt, (i = 1, 2),
(i) 2
ζt = [e Ht+j,j (θ 0 )e] δt2 +2(−1)i+1
j=0 j=0
(6.12)
and
(1) (2)
p
(1) (2)
q
(1) (2)
δt = (φ0,0 − φ0,0 ) + (φi,0 − φi,0 )Yt−i + (ψi,0 − ψi,0 )εt−i .
i=1 i=1
(k)
Let Fk (·|r0 ) be the conditional distribution of ζd+1 (k = 1, 2) given Y1 = r0 . To
describe the limiting distribution of rT , consider two independent compound Poisson
processes (CPPs) {℘(1) (z), z ≥ 0} and {℘(2) (z), z ≥ 0} with ℘(1) (0) = ℘(2) (0) = 0
a.s., and with the same jump rate π(r0 ) > 0, where π(·) is the pdf of Y1 , and with
the jump distributions F1 (·|r0 ) and F2 (·|r0 ), respectively. Define a two-sided CPP
{℘(z), z ∈ R} as follows
+
Observe that ℘(z) goes to ∞ a.s. when |z| → ∞ since xdFk (x|r0 ) > 0. Therefore,
there exists a unique random interval [M− , M+ ) on which the process (6.13) attains
its global minimum and nowhere else. Then, under some mild conditions, it can be
D
proved (Li et al., 2011) that: (i) T√( rT − r0 ) −→ M− , as T → ∞; and (ii) T ( rT − r0 )
is asymptotically independent of T ( τT −τ0 ) and their asymptotic distributions are
the same, regardless whether r0 is known or not. In particular,
√ √
(ii) Generate two independent jump time sequences {U1 , . . . , UN1 } and
i.i.d. i.i.d.
{V1 , . . . , VN2 }, where {Ui } ∼ U [−N, 0] and {Vi } ∼ U [0, N ].
(iii) Generate two independent jump-size sequences: {Y1 , . . . , YN1 } and {Z1 , . . . ,
ZN2 } from F1 (·|r0 ) and F2 (·|r0 ), respectively.
(iv) Create a set of equidistant points over the interval [−N, N ]. For z ∈ [−N, N ],
N1
compute the trajectory of (6.13), i.e., ℘(z) = I(z < 0) i=1 I(Ui > z)Yi +
N2
I(z ≥ 0) j=1 I(Vj < z)Zj . Find the smallest minimizer of ℘(z) on [−N, N ]
(b)
and call it M− .
(b)
(v) Repeat step (iv) B times, to obtain {M− }B
b=1 .
Algorithm 6.3 depends crucially on step (iii). When θ0 , π(r0 ), the distribution
Fε (·) of {εt }, and the distribution GZ0 (·) of Z0 = (Y0 , . . . , Y1−(p∨d) , ε0 , . . . , ε1−q ) are
known, the appropriate way to proceed is to first sample {εt }d+1+L t=2 independently
from Fε (·) where L is some large integer. Next, draw a sample (z1 , . . . , zK ) from
GZ0 (·) where K is another large integer, and zi = (Yi , . . . , Yi−(p∨d)+1 , ε0 , . . . , ε1−q ) ∈
R(p∨d)+q (i = 1, . . . , K). Then, generate {Yt }d+1+L
t=2 by iterating model (6.8) with
the initial values Y1 = r0 , Z0 = zi , and ε1 = r0 − g(zi , θ0 ) (i = 1, . . . , K).
(1) (1)
Obtain an approximation, say ζd+1,k , of ζd+1 (k = 1, . . . , K) by truncating the
infinite sums in (6.12) after L terms. Since
e Hd+1+j,j (θ0 )e
2 = O(ρj ) a.s., the
remaining term is negligible when L is large enough. Calculate the conditional
6.1 MODEL ESTIMATION 207
(ii) Sample {
εt }d+1+L
t=2 independently from Fε (·) given {Yt }Tt=1 .
(iii) Generate {Yt }d+1+L by iterating model (6.8) with the initial values Y1 = rT ,
d+1+j,j (θT ) = j [ψ
t=2
Z0 = zi , and ε1 = rT −g( zi ; θT ). Compute H i=1
2 +(ψ
1 −
2 )I(Yi+1 ≤ rT )] as an estimate of Hd+1+j,j (·).
ψ
L
ζd+1,k = d+1+j,j (θT )e]2 (δ ∗ )2
[e H
(1)
d+1
j=0
L
+2 d+1+j,j (θT )e] δ ∗ ,
εd+1+j [e H d+1
j=0
with
p
q
∗
=(φ0 − φ0 )+ (φ(1) (2) )Y ∗ (ψs(1) − ψs(2) )ε∗d+1−s ,
(1) (2)
δd+1 s − φ s d+1−s +
s=1 s=1
and
⎧ ⎧
⎨ Yj
⎪ j ≥ 2, ⎪
⎨ εj j ≥ 2,
∗
Yj = rT j = 1, ∗
εj = zi ; θT )
rT − g( j = 1,
⎪
⎩ Y ⎪
⎩ ε
i+j j ≤ 0, i+j j ≤ 0.
(v) Draw a U from a random sample, with replacement, from the integers 1
K
to T − p + 1, using a vector of positive weights π rT |
( (
zi )/ i=k0 +1 π rT |
zi )
(i = k0 + 1, . . . , K).
Figure 6.2: (a) Plot of the logistic transformed U.S. unemployment rate {Yt }252
t=1 ; (b) and
ri − ri,0 ) (i = 1, 2) with ri,0 the true threshold value.
(c) relative frequency histograms of T (
where Ti denotes the number of observations that belong to the ith regime,
and σT2i is the corresponding residual variance. The final SETAR model spe-
cification is given by
⎧ (1)
⎪
⎪ −0.55(0.17) + 1.69(0.12) Yt−1 − 0.81(0.14) Yt−2 + εt if Yt−5 ≤ −3.14,
⎪
⎨ 1.47(0.50) +2.16(0.17) Yt−1 −1.11(0.30) Yt−2 − 0.38(0.27) Yt−3
(2)
Yt = +0.57(0.29) Yt−4 + 0.25(0.27) Yt−5 + εt if − 3.14 < Yt−5 ≤ −2.97, (6.15)
⎪
⎪
⎪
⎩ −0.05 (0.05) + 1.47 Y
(0.07) t−1 − 0.45 Y
(0.14) t−2 + 0.07 Y
(0.14) t−3
(3)
−0.28(0.13) Yt−4 + 0.18(0.07) Yt−5 + εt if Yt−5 > −2.97,
where the sample variances of {εt } (i = 1, 2, 3) are 0.63 × 10−2 (T1 = 44),
(i)
0.19 × 10−2 (T2 = 34), and 0.17 × 10−2 (T3 = 172), and where the asymptotic
standard errors of the parameter estimates are in parentheses. The coefficient
(2) (2) (3) (3)
estimates of φ3 , φ5 , φ0 , and φ3 are not statistically different from zero
at the 5% nominal significance level. The p-values of the LB test statistic at
lags 6, 12, and 18 are, respectively, 0.54, 0.17 and 0.08, which suggests that
the fitted SETAR(2; 5, 5) model is adequate.
To run the simulation approach, we need some additional specifications. In
step (i) of Algorithm 6.3, we set N = 100 and estimate π(ri,0 ) (i = 1, 2) by
√
(
π ri,0 ) = T −1 Tt=1 Kh ( ri,0 ; Yt ) = ( 2πh)−2 exp{−(
ri,0 ; Yt ), where Kh ( ri,0 −
Yt ) /2h } with h ≡ hT > 0 the bandwidth from a Gaussian kernel density
2 2
1
T
fε (x) = K
h∗ (x; εt∗ ).
T − k0 opt
t=k0 +1
Here, we use a Gaussian kernel with an improved bandwidth (see, e.g., Fan
and Yao, 2003, p. 201)
35 35 385 2 −1/5
h∗opt =
hopt,T 1 + κ + τ +
κ ,
48 32 1024
1
See Appendix 7.A, for details on kernel estimation.
210 6 MODEL ESTIMATION, SELECTION, AND CHECKING
(i)
where εt = σi2 εt (i = 1, . . . , k), {εt } ∼ (0, 1), and R(i) = (ri−1 , ri ] with r0 = −∞
i.i.d.
and rk = ∞. The delay d, the thresholds ri , and the AR and MA lags in each regime
are called structural parameters . They are collected together into the long vector
x∗ = d, r1 , . . . , rk−1 ; {pi ; j1 , . . . , jp(i)
(i) (i)
i
| qi ; h1 , . . . , h (i)
q i
, i = 1, . . . , k} . (6.17)
to use an ARMA–LS estimation method due to Hannan and Rissanen (1982); see,
e.g., step (i) in Algorithm 6.3. Given a set of observations {Yt }Tt=1 , and assuming
x∗ is known, the CLS estimation procedure is as follows.
Algorithm 6.5: k-regime subset SETARMA–CLS estimation
(i) For each regime i, fit a high-order AR(n) (1 ≤ n ≤ nmax ) model to the series
using the Yule–Walker equations. Select n by AIC, and set nmax = (log T )a
( i)
(0 < a < ∞). Calculate { εt }Tt=n+1 (i = 1, . . . , k).
(ii) Set the maximum orders P and Q of respectively the AR and MA lags
sufficiently large such that pi ≤ p ≤ P ≤ n and qi ≤ q ≤ Q.
(iv) Find the optimal structural parameter vector by minimizing the normalized
AIC (NAIC) values, that is
k
# $
NAIC(x∗ ) = T2i +2(pi +qi +1) /(effective sample size),
Ti log σ
i=1
where Ti is the number of observations that belong to the ith regime, and
T2 i denotes the corresponding residual variance.
σ
(v) Repeat steps (i) – (iv) for each d ∈[1, dmax ], with dmax a pre-specified integer.
(iii) Keep the best string intact for the next generation and create offspring strings
by three evolutionary operators:
(iv) Form the new population using the results of step (iii). If the search aim is
achieved, stop; else go to step (ii).
STAR models
Efficient estimation of STAR-type nonlinear models can be carried out by NLS or,
assuming the errors are normally distributed, by QML. Under certain regularity
conditions both methods will result in estimates that are consistent and asymptot-
ically normally distributed. Below we outline nonlinear CLS estimation of LSTAR
214 6 MODEL ESTIMATION, SELECTION, AND CHECKING
models, but the issues that are addressed also apply to ESTAR, time-varying STAR,
and multiple-regime STAR models.
Recall from Section 2.7 that for a stationary and ergodic time series process
{Yt , t ∈ Z} the LSTAR(2; p, p) model is defined by
p
p
Yt = φ0 + φi Yt−i + ξ0 + ξi Yt−i G(Yt−d ; γ, c) + εt ,
i=1 i=1
= φ Xt + ξ Xt G(Yt−d ; γ, c) + εt , (6.19)
where
with {εt } ∼ (0, 1), and G(·) is a logistic function defined by (2.43). Then, subject
i.i.d.
to some initial values, the problem is to minimize the ordinary least squares function
T
# $2
LT (θ) = Yt − φ Xt − ξ Xt G(Yt−d ; γ, c) (6.20)
t=1
with respect to θ = (φ , ξ , γ, c) . However, joint estimation of θ is not an easy task
in general and can result in large γ values. One reason is that γ is not scale invariant,
making it difficult to find a good starting value. To overcome this problem, and to
improve the stability and speed of the numerical optimization procedure, it is usually
preferred to estimate LSTAR models using the following transition function
# $−1
G(Yt−d ; γ, c) = 1 + exp(−γ[Yt−d − c]2 / σY2 ) , γ > 0, (6.21)
where σ Y2 is the sample variance of {Yt−d }. Thus, the original slope parameter γ is
transformed into a scale-free parameter.
Note that when the parameters γ and c are known and fixed, the LSTAR model
is linear in the AR parameters φ and ξ. Hence, assuming d and p are known, the
parameter vector τ = (φ , ξ ) can be estimated by CLS as
T −1
T
τ(γ, c) = t (γ, c)X
X (γ, c) t (γ, c)Yt ,
X (6.22)
t
t=1 t=1
T
# $2
LT (γ, c) = t (γ, c)
Yt − τ (γ, c)X . (6.23)
t=1
So, minimization of (6.20) is only performed over γ and c, which helps to reduce the
computational burden considerably.
6.1 MODEL ESTIMATION 215
Using (6.23) some cautionary remarks are in order. It is apparent from Figure
2.9 that when the true slope parameter γ is relatively large, the slope of G(·) at c
is steep. In that case a meaningful set of grid values for the location parameter c is
needed (e.g., the sample percentiles of the transition variable Yt−d ) so that the value
of the transition function G(·) varies sufficiently across the whole sample, and the
optimization algorithm converges. Otherwise, the moment matrix of the regression
(6.22) is ill-conditioned and the estimation fails. It is also recommended to have a
large number of observations in the neighborhood of c to estimate γ accurately. If
there are not many data values near c, γ will be poorly estimated, and so convergence
may be slow. This situation may well result in a parameter estimate of γ which is
not statistically different from zero as judged by, for instance, a large standard error
and a small Student t-statistic. The calculated t-statistic, however, will not have an
exact Student t distribution under the null hypothesis γ = 0, since then the LSTAR
model is no longer identified; see Section 2.7. One implication is that in practice one
should focus upon the end use of the LSTAR model when attempting to evaluate it
and not necessarily on the parameter estimates.
p−1
ΔYt = α0 + β0 Yt−1 + ψ0i ΔYt−i + δ Dt
i=1
p−1
+ α1 + β1 Yt−1 + ψ1i ΔYt−i + δ Dt G(Yt−d ; γ, c) + εt , (6.24)
i=1
where ΔYt ≡ Yt − Yt−1 denotes the first-difference of the time series {Yt }, Dt
is a vector of monthly dummy variables, and δ the corresponding parameter
vector.
When Yt−d = c, the adjustment process is given by the first term on the right-
hand side of (6.24), and as Yt−d → ±∞, the adjustment process is given by
(6.24) with G(·) = 1. Here, the crucial parameters are β0 and β1 . Since large
deviations are mean-reverting, it implies that β1 < 0 and β0 + β1 < 0, while
β0 ≥ 0 is possible. A linear version of the regression in (6.24), called error
216 6 MODEL ESTIMATION, SELECTION, AND CHECKING
p−1
ΔYt = α0 + β0 Yt−1 + ψi ΔYt−i + δ Dt + εt . (6.25)
i=1
Below we show estimation results for the series covering the time period Janu-
ary 1952 – December 1990 (T = 468). Later, in Chapter 10, we employ the
remaining part of the series for a rolling out-of-sample forecasting experiment.
Using a battery of time-domain nonlinearity tests, we obtain the following
best-fitting (in terms of minimum AIC) model for the series
where
# ! "$−1
G(Yt−1 ; γ, c) = 1 + exp (−1.95(0.83) /0.82)(Yt−1 −(−0.77)(0.33) ) , (6.26)
Bilinear models
There are many methods for estimating coefficients of BL models. Among them is
the LS method, which is one of the most frequently applied. However, apart from
some simple BL models, the asymptotic properties of the LS estimates are unknown.
6.1 MODEL ESTIMATION 217
Figure 6.4: (a) Transition function (6.26) as a function of Yt−1 (blue dots), and an
estimate of the threshold value (red medium dashed line); (b) SST anomaly (blue solid line)
and transition function (6.26) (red dotted line) as a function of time.
In this section, we discuss a CLS approach with known asymptotic properties and
proposed by Grahn (1995) for a special case of (2.12). In particular, we want to
estimate the BL model:
p
q
k
r
Yt = φ0 + φi Yt−i + εt + ψj εt−j + τij εt−i Yt−j , (6.27)
i=1 j=1 i=1 j=w
w−1+s
k
dj (s) ≡ τsj σε2 + (ψi τi−s,j−s + ψi−s τij )σε2 and hj,n (s) ≡ τij τi−s,n σε2 ,
i=s+1 i=s+1
(j = w, . . . , r + s; n = w, . . . , r),
and ψi ≡ 0 for i > q and τij ≡ 0 ∀i, j taking values outside the summation domain.
Thus, Cov(vt , vt−s |εt−w , εt−w−1 , . . .) depends on the parameters and a finite set of
218 6 MODEL ESTIMATION, SELECTION, AND CHECKING
observations {Yt }Tt=1 only. As we will see in Algorithm 6.7, this property will be the
basis for the proposed CLS estimation procedure.
Let β0 (s) be the true value of the parameter vector β(s) at lag s, i.e.
β(s) = γY (s), dw (s), . . . , dr+s (s), hww (s), . . . , hwr (s), . . . , hrw (s), . . . , hrr (s) .
(6.29)
Hence, in the second step, the aim is to find an estimator β(s) of β0 (s). Now,
summarizing the above results, the computation of CLS estimates goes as follows.
C =
pφ c,
T
# $2
vt vt−s − E(vt vt−s |εt−w , εt−w−1 , . . .) (6.30)
t=(r+s)∨(p+1)
with respect to β(s) (s = 0, 1, . . . , w −1), giving rise to β(s). It can be shown
(Grahn, 1995) that β(s) → β0 (s) a.s., as T → ∞.
The function γY (s) can be interpreted as the ACVF of this process. Therefore,
γY (s) = σε2 q−s
j=0 ψj ψj+s . The equations which must be solved to obtain the MA
parameters can be written, in two alternative ways, as
⎛ ⎞⎛ ⎞
⎛ ⎞ ψ0 ψ1 ··· ψq−1 ψq ψ0
γY (0) ⎜ ⎟⎜ ⎟
⎜ ⎟ ⎜ ψ1 ψ0 ··· ψq 0 ⎟⎜ ψ1 ⎟
⎜ γY (1) ⎟ 2⎜ .. .. .. .. ⎟⎜ .. ⎟
⎜ ⎟ = σε ⎜ . ⎟⎜ ⎟
⎝
..
⎠ ⎜ . . .. . . ⎟⎜ . ⎟
. ⎝ ⎠⎝ ⎠
ψq−1 ψq ··· 0 0 ψq−1
γY (q)
ψq 0 ··· 0 0 ψq
⎛ ⎞⎛ ⎞
ψ0 ψ1 ··· ψq−1 ψq ψ0
⎜ 0 ψ0 ··· ψq−2 ψq−1 ⎟⎜ ψ1 ⎟
⎜ ⎟⎜ ⎟
⎜ .. .. .. .. .. ⎟⎜ .. ⎟
= σε2 ⎜ . . . . . ⎟⎜ . ⎟.
⎜ ⎟⎜ ⎟
⎝ 0 0 ··· ψ0 ψ1 ⎠⎝ ψq−1 ⎠
0 0 ··· 0 ψ0 ψq
6.1 MODEL ESTIMATION 219
where A# is a (q + 1) × (q + 1) matrix
with constant skew-diagonals, called Hankel
matrix , γY = γY (0), γY (1), . . . , γY (q) , and ψ = (ψ0 , ψ1 , . . . , ψq ) .
Now, the objective is to solve
Since (6.32) is nonlinear in ψ, its solution must be found via an iterative procedure.
For instance, we can use the Newton–Raphson algorithm (see, e.g., Wilson, 1969).
In this case the (u + 1)th approximation, say ψ (u+1) , to the final solution obtained
from the uth approximation ψ (u) (u ≥ 0) is given by
which is equivalent to
where the subscript u indicates that the elements are to be evaluated at ψ = ψ (u) .
The equation for γY (s) can be normalized either by setting σε2 = 1 or by setting
ψ0 = 1. In the first case, it is reasonable to choose ψ0 = γY (0) and ψ1 = · · · = ψq = 0
as starting values of the iterative procedure. Once it has converged, the equation
for γY (s) can be re-normalized so that ψ0 = 1.
Below we present a procedure for identifying the BL parameters τij from dj (s)
(j = w, . . . , r + s; s = 0, 1, . . . , w − 1). For simplicity, we assume that the equation
for dj (s) is normalized either by setting σε2 = 1 or by considering dj (s)/σε2 . Define
the following two 12 w(2r − w + 1) × 1 vectors
τ = d,
T (6.33)
where
⎛ ⎞
D0 U0,1 ··· U0,w−2 U0,w−1
⎜ L1,0 D1 ··· U1,w−2 U1,w−1 ⎟
⎜ ⎟
T=⎜ .. .. .. .. ⎟,
⎝ . . . . ⎠
Lw−1,0 Lw−1,1 ··· Lw−1,w−2 Dw−1
220 6 MODEL ESTIMATION, SELECTION, AND CHECKING
with
⎛ 1 ⎞
⎜ . ⎟
⎜ 0 .. ⎟
⎜ ⎟ ⎛ ⎞
⎜ .. ⎟ ···
⎜ . ⎟ 2ψj 0 0
⎜ ⎟ ⎜ .. ⎟ ,
Di =⎜ ⎜ 0 .. ⎟,
⎟ U0,j =⎝ .. ..
. . .⎠
⎜ . ⎟
(h + i) × (h + i)⎜ ⎟ (h + i) × (h + j) 2ψj 0 · · · 0
0 ≤ i ≤ w − 1 ⎜ψ2i ⎟ 0≤j ≤w−1
⎜ ⎟
⎝ ..
. .
.. ⎠
ψ2i 0 ··· 0 1
↑
i+1
⎛ψ ⎞
j−i ⎛ ⎞
⎜ .. ⎟ 0
⎜ 0 . ⎟ ⎜ .. .. ⎟
⎜ ⎟ ⎜ . ⎟
⎜ .. ⎟ ⎜
.
⎟
⎜ . ⎟ ⎜ 0 ⎟
⎜ ⎟ ⎜ ⎟
Ui,j =⎜⎜ 0 .. ⎟,
⎟ Li,j =⎜ .. ⎟,
⎜ . ⎟ ⎜ψi+j . ⎟
(h + i) × (h + j) ⎜ψ ⎟ ⎜
(h + j) × (h + i) ⎜
⎟
⎟
0 ≤ i < j ≤ w − 1 ⎜ j+i ⎟ 0≤j <i≤w−1 ⎝ .. ⎠
⎜ ⎟ .
⎝ ..
.
..
. ⎠
ψi+j 0 · · · 0
ψj+i 0 · · · 0 ψj−i
↑ ↑
i+1 i+1
T
τ = d, (6.34)
(i) θ → θ0 a.s.
√
(ii) T (θ − θ0 ) is asymptotically normally distributed with mean zero. Moreover,
the law of iterated logarithm holds, i.e. ( θ − θ0 ) = O(ST ) a.s., with ST =
{T / log log T }−1/2 .
6.1 MODEL ESTIMATION 221
√
Figure 6.5: Boxplots and Q-Q plots of τ − τ ) for τ = 0.3 (panels (a) and (c)), and
T (
τ = 0.5 (panels (b) and (d)); 1,000 MC replications.
Clearly, we use three estimators β1 (0), β2 (0), and β2 (1) to estimate two
unknown parameters (τ and σε2 ). Moreover, we neglect information contained
in the product τ 2 σε2 . Instead of coding this term as β11 β22 , it is only included as
the additional parameter β2 (0) in (6.37). These somewhat unfavorable features
of Algorithm 6.7 can be amended by trying to minimize the conditional sum
of squares
T
#
$2
T
# $2
Yt2 − θ2 + θ12 θ2 Yt−2
2
+ Yt Yt−1 − θ1 θ2 Yt−2
t=3 t=3
√
Figure 6.6: Boxplots of σε2 − σε2 ) for (a) τ = 0.3, and (b) τ = 0.5; 1,000 MC
T (
replications.
Clearly, for increasing values of |τ | the nonlinearity of the generated time series
becomes more prominent, and as a consequence CLS estimation becomes more
difficult. Still, for all values of T , the boxplots in Figure 6.5 look almost sym-
metric and most of them can be interpreted as being sampled from a Gaussian
distribution. The Q-Q plots confirm this observation. However, all distribu-
tions tend to have negative medians as well as negative means. This tendency
reduces with increasing values of T and is due to the interaction between val-
ues of τ and values of σ ε2 . From Figure 6.6 we see that σ ε2 overestimates
the parameter σε2 , and this phenomenon is more present as τ increases from
0.3 to 0.5. According to its definition β1 (0) is a positive quantity, but β2 (1)
can be either positive or negative. If β2 (1) > 0, overestimating σε2 will imply
that τ < τ . On the other hand, if β2 (1) ≤ 0, τ ≤ 0. Hence, in both cases,
overestimating σε2 results in underestimation of the parameter τ .
General formulation
Let θ be an m-dimensional parameter vector of interest. Assume that the actual
value θ0 generating y, an T × 1 random vector of observations with corresponding
density function f (y; θ), belongs to an open parameter space Θ ⊆ Rm . The ML
estimate θ of θ0 follows from solving
θ)/∂ θ .
(i) Fisher’s information matrix is given by ∂g(θ, θ=θ
(ii) If θ (0) is a given starting value, and define in the (u + 1)th iteration θ (u+1)
(u ≥ 0) as a root of the equation, g(θ, θ (u) ) = G(y, θ (u) ), then θ (u) → θ as
u → ∞. Furthermore, it can be shown that |θ (u) − θ| = Op (T −u/2 ).
Hence,
∂g(θ,
θ) −1
θ ≈ θ + G(y, θ), (6.41)
∂ θ
θ=θ
⎧
⎪
⎪
p1
⎪
⎪ φ
(1)
+
(1)
φi Yt−i + εt if Yt−d ≤ r,
⎨ 0
Yt = i=1 (6.42)
⎪
⎪ p2
⎪
⎪
(2) (2)
⎩ φ0 + φi Yt−i + εt if Yt−d > r,
i=1
⎧
⎪
⎪ q1
⎪
⎪
(1) (1)
αi ε2t−i if Yt−d ≤ r,
⎨ α0 +
σt2 = i=1 (6.43)
⎪
⎪
q2
⎪
⎪ α
(2)
+
(2)
αi ε2t−i if Yt−d > r,
⎩ 0
i=1
6.1 MODEL ESTIMATION 225
where {εt |F t−1 } ∼ N (0, σt2 ) with F t−1 = {Yt−1 , Yt−2 , . . .} the available information
i.i.d.
set at time t − 1. The conditional mean and conditional variance of {Yt , t ∈ Z} are
given by
2
(i)
pi
(i)
(i)
2
(i)
qi
(i)
(i)
μt = φ0 + φj Yt−j It , σt2 = α0 + αj ε2t−j It ,
i=1 j=1 i=1 j=1
where It = I(Yt−d ≤ r) and It = I(Yt−d > r), and θ = (φ1 , α1 , φ2 , α2 , r) with
(1) (2)
1 ε2
(i)
T 2
QT (θ) = − log σt2 + t2 It ,
2 σt
t=1 i=1
T
T
T
=
Zt Wt Xt θ(r) Zt Wt Zt θ(r) + Zt Wt Xt , (6.44)
t=1 t=1 t=1
where
∂σt2 /∂θ 1/2σt4 0 (Yt − μt )2 − σt2
Zt = , Wt = , Xt = .
∂μt /∂θ 0 1/σt2 Yt − μt
Figure 6.7: Time plots of (a) the daily closing prices, and (b) the log-returns for the Hong
Kong Hang Seng Index (HSI) for the year 2010.
with
1.29 + 0.02ε2t−1 if Yt−1 ≤ 0.16,
σt2 = 0.91 + 0.73ε2t−1 if 0.16 < Yt−1 ≤ 1.03, (6.47)
0.24 + 0.02ε2t−1 + 0.07ε2t−2 + 0.13ε2t−3 if Yt−1 > 1.03,
(i)
where εt = σt2 εt (i = 1, 2, 3) and {εt } ∼ N (0, 1). The sample variances of
i.i.d.
(i)
{εt } are 1.31 (T = 138), 1.18 (T = 58), and 57 (T = 49), respectively. The
sample variances of the volatility equation are 3.41, 1.87, and 76.3, respectively.
The most important feature is clearly the difference in the behavior of the series
in each regime. When Yt−1 is between 0.16 and 1.03 the behavior is slower in
6.2 MODEL SELECTION TOOLS 227
adjusting to shocks than in the third regime. In the first regime the series {Pt }
closely approximates a random walk process with a drift term. The behavior
of the conditional variance also varies considerably between regimes; shocks to
the conditional variance are more persistent in the second and third regime,
and weakly persistent in the first regime. Observe, all estimated coefficients in
σt2 are nonnegative. Negative coefficients are counter-intuitive in (6.43) which
implies that the IWLS algorithm needs to be constrained.
The equality in (6.49) and (6.50) arises if and only if fm (·; θm )/f (·; θ0,m ) is degenerate
at E0 {fm (·; θm )/f (·; θ0,m )} (= 1), in other words if and only if fm (·; θm )= f (·; θ0,m )
a.e. In particular, the equality in (6.49) and (6.50) holds when θm = θ0 .
The application of Jensen’s inequality clarifies that I KL (·) is determined by the
dispersion of fm (·; θm )/f (·; θ0,m ), and this explains why I KL (·) can serve as a meas-
ure of the divergence between the density function fm (·; θm ) and the true density
228 6 MODEL ESTIMATION, SELECTION, AND CHECKING
where
1 ∂ 2 log fm (y; θ0,m ) 1 ∂ log f (y; θ )
a.s. m 0,m
Hm (y) = lim , I m (y) = lim Var .
T →∞ T ∂θ∂θ T →∞ T ∂θ
Hence, the third term on the right-hand side of (6.52) becomes
√ 1 ∂ 2 log fm (y; θ) √
Ey T (θT,m − θ0,m ) T ( T,m − θ0,m ) =
θ
T ∂θ∂θ θ=θ0,m
# $
Recall that y and x have the same pdf (which implies that Hm (y) = Hm (x)) and
that they are independent of each other. Consider the term 2Ey Ex {log fm (x; θT,m )}
in (6.52). Assuming that Ex (·) is sufficiently smooth, and its derivatives under
the expectation sign exist, a second-order Taylor expansion of 2Ex {log fm (x; θT,m )}
around θ0,m yields
= 2Ex {log fm (x; θ0,m )} + T (θT,m − θ0,m )Hm (y)(θT,m − θ0,m ) + op (1). (6.56)
2Ey Ex {log fm (x; θT,m )} = 2Ex {log fm (x; θ0,m )}+tr I m (y)H−1
m (y) . (6.57)
where the acronym AIC stands for Akaike information criterion . Clearly, this model
selection criterion establishes a certain balance between the model-size pm and the
lack-of-fit measured by −2 log fm (y; θT,m ). In other words, it is beneficial to simplify
230 6 MODEL ESTIMATION, SELECTION, AND CHECKING
the model, by leaving out the less important aspects, as long as the reduction in
model-size outweighs the deterioration of the fit.
The performance of the AIC rule can be judged in different ways. One reasonable
scenario is to assume that the approximating parametric family of models Mm
includes the DGP. This is a strong assumption, but it is also used in the derivation
of AIC. Then it can be shown (see, e.g., McQuarrie and Tsai, 1998) that, under quite
general conditions, the AIC rule is inconsistent and the asymptotic probability of
overfitting is not insignificant, as T → ∞. A more practical scenario is to assume
that the DGP is more complex than any of the candidate models. In such a case the
selected model can be viewed as an approximation of the DGP, and we can consider,
for instance, the model’s average prediction error as a performance measure of the
AIC rule.
AICc rule
Hurwich and Tsai (1989) obtain an approximation of (6.58) for univariate linear
regression and AR time series models that reduces the small sample bias of the AIC
rule. This so-called corrected AIC (AICc ) is given by
2T pm
AICc (m) = −2 log fm (y; θT,m ) + . (6.60)
T − pm − 1
Due to the second term in (6.60), AICc has a smaller risk of overfitting than AIC
for finite values of T . With this fact in mind, and being pragmatic rather than
theoretical, AICc can be used as an order selection criterion for more general linear
and nonlinear time series models.
AICu rule
McQuarrie et al. (1997) introduce an alternative criterion for linear regression time
series models which is an approximate unbiased (u) estimate of the KL information
I(m) defined in (6.51). This criterion, denoted by AIC u , is given by
2T pm T
AICu (m) = −2 log fm (y; θT,m ) + + 2T log . (6.61)
T − pm − 1 T − pm
However, AICu is neither a consistent nor an asymptotically efficient criterion. The
criterion has a good performance in finite samples, and hence can be adopted for
more general models than just linear regressions.
In practice, the term on the right-hand side of (6.62) can be replaced by an unbiased
estimator. The resulting criterion, called generalized information criterion (GIC),
is given by
Clearly, when ν = 1, GIC reduces to AIC. Extensive simulation studies (see, e.g.,
Bhansali and Downham, 1977) have empirically shown that for ν ∈ [2, 5] the correct
order is found more frequently than AIC. The Bayesian approach of the next section
provides an explicit expression for the term (ν + 1).
where
f (y|Mm ) = f (y|θm , Mm )f (θm |Mm )dθm ,
and where the symbol ∝ denotes proportionality. Assuming the same prior probab-
ility for all models, Schwarz (1978) derives the following large sample approximation
pm
log f (y|Mm ) ≈ log fm (y; θT,m ) − log T. (6.64)
2
Hence, maximizing (6.64) is equivalent to minimizing the Bayesian information cri-
terion (BIC):
independently of the chosen prior. It is an interesting fact that the BIC rule can
also be derived within the KL framework. Moreover, it can be shown (see, e.g.,
McQuarrie and Tsai, 1998) that the BIC rule is consistent, that is the probability
of correct detection approaches one as T → ∞.
All five order selection criteria AIC, AIC c , AICu , BIC and GIC have a common
form, that is they are members of the family of criteria
# $
min − 2 log fm (y; θT,m ) + pm C(T, pm ) , (6.66)
θm ∈Θ
232 6 MODEL ESTIMATION, SELECTION, AND CHECKING
C(T, pm )
T
Figure 6.8: Penalty functions C(T, pm ) of AIC (pink solid line), AICc with pm = 5 (blue
long dashed line), AICu with pm = 5 (red dotted line), BIC (green short dashed line), and
GIC with ν = 3 (cyan medium dashed line).
but with a different penalty function C(T, pm ). Figure 6.8 shows the behavior of
C(T, pm ) as a function of T for each selection rule.
Given the above model selection criteria, an obvious question is: Which criterion
to use in practice? Unfortunately, within the context of nonlinear time series this
question has been the subject of only a few papers (cf. Section 6.2.6). Overall,
AICc outperforms AIC and BIC in small samples. BIC penalizes models which are
over-parameterized and so gives some value to parsimony. For this reason one may
prefer BIC over other criteria. On the other hand, if parsimony is not considered
to be really important, one may use a criterion which picks up any subtle nuance
in the data and as a result the fitted nonlinear model will be inclined to overfit in
sample. In fact, we recommend that any model should be evaluated in terms of its
out-of-sample forecasting ability, and compared with forecasts from linear and other
nonlinear time series models.
on a functional form, e.g. a conditional mean function g(·; θg ), BIC does not take
this extra complexity into account, while in MDL, this extra bit of uncertainty is
reflected in I(·). For parametric models an estimator of I(·) is given by (6.4). The
integration in the last term of (6.67) can be well approximated by MC simulation
methods (see, e.g., Robert and Casella, 2004).
k
# $
SC(p1 , . . . , pk ) = min T2i +(pi + 1)C(Ti , pi + 1)
Ti log σ , (6.68)
p1 ,...,pk
i=1
T2i the
where Ti (i = 1, . . . , k) denotes the number of observations in each regime, σ
corresponding (conditional) residual variance, and with penalty function
⎧
⎪
⎪ 2 for AIC,
⎪
⎪
⎨ 1 Ti (Ti +pi +1)
for AICc ,
pi +1 Ti −(pi +1)−2
C(Ti , pi + 1) =
⎪
⎪
1 Ti (Ti +pi +1)
+ Ti log Ti
for AICu ,
⎪
⎪ pi +1 Ti −(pi +1)−2 Ti −(pi +1)−1
⎩ log T for BIC.
i
(ii) Assume r ∈ [r, r] ⊂ R with r the 0.25×100% percentile and r the 0.75×100%
percentile of {Yt }Tt=1 .
(iii) Let {Y(j) (d)}Tj=1 denote the order statistics of {Yt }Tt=1 for a fixed d ∈
[1, dmax ]. Let Ir = {[0.25T ], [0.25T ] + 1, . . . , [0.75T ]}. Set r = Y(j) (d).
234 6 MODEL ESTIMATION, SELECTION, AND CHECKING
(iv) Calculate min1≤k1 ≤p∗1 ,1≤k2 ≤p∗2 SC(k1 , k2 ) . Let SC Y(j) (d) be the min-
imum. Denote the corresponding model orders giving this minimum as
ki∗ Y(j) (d) (i = 1, 2). Note, in the calculation the first max(d, p∗1 , p∗2 ) obser-
vations should be discarded to make the comparison meaningful.
(v) Calculate minj∈Ir SC Y(j) (d) , and denote the value of Y(j) (d) giving this
∗
minimum as Y(j) (d).
∗
(vi) Calculate min1≤d≤dmax SC Y(j) (d) , and denote the value of d giving this
minimum as d.
(vii) The selected delay parameter is d, the estimate of the threshold parameter
∗
∗
The second set of order selection criteria is based on the concept of CV. This
comes down to dividing the available data set into two subsets: a calibration set for
estimating a model, and a validation set for evaluating its performance, as we briefly
explained in Section 6.2.3. In principle these subsets may contain different number
of observations. Within the context of SETAR(2; p1 , p2 ) model selection, however,
we focus on the so-called leave-one-out CV-criterion. In that case the order selection
procedure goes as follows.
Algorithm 6.9: Leave-one-out CV order selection
(i) Follow steps (i) – (iii) of Algorithm 6.8.
(ii) Omit one observation from the available data set {Yt }Tt=1 , and with the
remaining data set obtain the CLS estimates of a SETAR model, using Al-
gorithm 6.2. Let r(t) be the corresponding estimate of r, and φ(t)
T −1,i an
estimate of φ = (φ0 , . . . , φpi ) (i = 1, 2).
(i) (i)
(iii) Predict the omitted observation and obtain the predictive residual
(t) , r(t) ).
εt (φT −1,i
(v) The final model is the one which minimizes the MSFE over all SETAR mod-
els:
T 2
min C(p1 , p2 ) = (t) , r(t) ) ,
εt2 (φ (6.69)
p1 ,p2 T −1,i
t=s i=1
Under fairly weak conditions it can be proved (Stoica et al., 1986) that for
6.2 MODEL SELECTION TOOLS 235
linear time series regressions T log{T −1 C(·)} = AIC(·) + O(T −1/2 ). Using this rela-
tionship, De Gooijer (2001) proposes the following CV model selection criteria for
SETAR(k; p, . . . , p) models
T
k k
Ti (Ti + pi + 1)
Cc = T log (t) , r(t) ) +
εt2 (φ , (6.70)
T −1,i
t=s i=1
Ti − (pi + 1) − 2
i=1
T
k k
Ti (Ti + pi + 1)
Cu = T log (t) , r(t) ) +
εt2 (φT −1,i
T − (pi + 1) − 2
t=s i=1 i=1 i
Ti
+ Ti log . (6.71)
Ti − (pi + 1) − 1
De Gooijer (2001) and Galeano and Peña (2007) compare by simulation the
performance of various CV- and AIC-type (including BIC) criteria for two-regime
SETAR model selection in case both d and r are unknown. Their results indicate
that AICu and Cu have larger frequencies in detecting the true AR orders and delay
parameters than AIC, AICc , and BIC, when the sample size is small to moderate
(T ∈ [30, 75]). Since AICu and Cu will tend to select a more parsimonious two-
regime SETAR model than AIC, we recommend to use both criteria rather than
AIC for relatively small samples. The extra computing time C u needs, as opposed
to the time it takes to estimate a “conventional” criterion like AIC, is negligible
for T ≤ 75. Otherwise, i.e., in situations with T ≥ 100, the improvement of the
modified criteria over AIC diminishes.
Example 6.7: U.S. Unemployment Rate (Cont’d)
It is interesting to compare the performance of the above model selection
criteria using the transformed quarterly U.S. unemployment rate series {Yt }252
t=1
plotted in Figure 6.2(a). For two-regime SETAR models, we set the maximum
allowable orders p 1,max = p 2,max = 10. For three-regime SETAR models, we
take p 1,max = p 2,max = p 3,max = 6. In both cases, we prefix the maximum
value of the delay at dmax = 10. Parameter estimates are based on CLS.
Candidate threshold values are searched between the 25th and 75th percentiles
of the empirical distribution of {Yt }.
Table 6.1 contains the orders of the selected SETAR models, jointly with se-
lected values of d and estimates of the threshold parameters. We see that AIC
prefers a model with relatively high AR orders in each regime while almost all
other criteria tend to select a more parsimonious model. Of course, the pref-
erence for a less parsimonious or a parsimonious criterion largely depends on
how one weighs these overfitting or underfitting tendencies in a given empirical
situation. Note, that AICu and BIC favor a SETAR(2; 2, 2) model with delay
d = 5 while CVc and CVu choose the same model with d = 10. Also, in the
case of selecting a three-regime SETAR model, there is hardly any difference
between the orders selected by AIC c , AICu , BIC, CV, and CVc . One inter-
esting situation occurs with CV u with all orders equal one and d = 1. Clearly,
236 6 MODEL ESTIMATION, SELECTION, AND CHECKING
Table 6.1: SETAR orders selected for the transformed quarterly U.S. unemployment rate.
the estimated threshold parameter values are quite near to each other, sug-
gesting that a two-regime rather than a three-regime SETAR model is more
appropriate in this case.
∗ (ε ) .
, . . . , u∗P (εt ) and V(εt ) = v1∗ (εt ), . . . , vQ t
Naturally, given {Yt }Tt=1 , we replace the above quantities by their correspond-
ing sample statistics with θT the QML or CLS estimator of θ. Denote the estim-
ated Pearson residuals by εt ≡ εt (θT ) = (Yt − gt )/ ht in which gt ≡ g(Yt−1 ; θT )
1/2
and ht ≡ h(Yt−1 ; θT ). Let μ ui and μ vj ( σu2i and σ v2j ) be, respectively, the sample
means (variances) of ui (·) and vj (·). Moreover, let ∇θ gt and ∇θ ht be, respectively,
the column vectors of partial derivatives of gt and ht with respect to θ. Denote
−1/2
wt = (∇θ gt )ht , zt = (∇θ ht )h−1 t = wt |θ=θ
,
t , w T
zt = zt |θ=θ
T , u∗i ( εt ) = (ui (εt ) −
ui )/
μ ∗
σui , and vj ( εt ) = (vj ( εt )− μvj )/
σvj . The lag sample cross-correlation of ui ( εt )
εt− ) is given by ρε
() = (T − )−1 t=+1 u ∗i ( vj∗ (
(i,j) T
and vj ( εt ) εt− ) and the sample
(1,1) (1,Q) (P,1) (P,Q)
analogue of ρ() is ρ() = ρε
(), . . . , ρε
(), . . . , ρε
(), . . . , ρε
() . Fi-
nally, to describe the asymptotic behavior of a finite set vectors, we define a
of ρ()
P QM × 1 (M T ) vector Π(M ) = ρ(1), )
. . . , ρ(M
Under H0 , and certain regularity conditions, it can be shown (Chen, 2008) that
√ 1 T
=√
T − k ρ() Ψ(εt , εt− ) + op (1),
T − t=k+1
where
1
Ψ(εt , εt− ) = U(εt ) ⊗ V(εt− ) − Λ()Υ−1 [wt εt + zt (ε2t − 1)],
2
1
Υ = E[wt wt ] + E[zt zt ],
2
238 6 MODEL ESTIMATION, SELECTION, AND CHECKING
and
1
Λ() = E[∇U(εt )] ⊗ E[V(εt− )wt ] + E[∇U(εt )] ⊗ E[V(εt− )zt ],
2
!
T
T
"
Cov Ψ(εt , εt− ), Ψ(εt , εt− ) = (T − )[δ IP Q + A(, )], (6.76)
t=+1 t= +1
where
From the proof of this last result it can be deduced that {Ψ(εt , εt− )} is a sequence
of uncorrelated elements. Then it follows that the asymptotic null distribution is
given by
√ D
−→ NP Q 0, Σ(l) ,
T − ρ() Σ() = IP Q + A(, ), (6.77)
T Π(M
D
) −→ NP QM 0, Ξ(M ) , Ξ() = IP QM + B(M ), (6.78)
()Σ
CT () = (T − ) Γ −1 ()Γ(),
(6.79)
T
(M )Ξ
QT (M ) = T Π −1 (M )Π(M
), (6.80)
T
T () and Ξ
where Σ T (M ) are consistent estimates of Σ() and Ξ(M ), respectively.
D
Under H0 , and as T → ∞, it follows that for any fixed , CT () −→ χ2P Q , and for
D
any fixed M , QT (M ) −→ χ2P QM .
6.3 DIAGNOSTIC CHECKING 239
(i,j) (i,j)
CT () QT (M )
Model (i, j) = 1 = 3 = 5 M =5
SETAR(2; 1, 1) (1, 1) 0.56 0.31 0.14 2.26
(1, 2) 0.17 0.58 0.00 3.68
(2, 1) 2.00 3.21 0.60 6.51
(2, 2) 0.52 0.59 2.16 4.05
SETAR(2; 1, 1)–GARCH(1, 1) (1, 1) 0.07 0.52 0.26 1.78
(1, 2) 0.00 0.14 0.06 1.89
(2, 1) 2.07 2.32 0.41 6.40
(2, 2) 4.68 0.03 0.76 7.59
SETAR(2; 1, 1)–EGARCH(1, 1) (1, 1) 0.14 0.63 0.30 2.03
(1, 2) 0.02 0.60 0.19 2.62
(2, 1) 0.83 1.03 0.07 4.10
(2, 2) 3.67 0.03 0.61 7.36
(1)
The 95% critical values of the χ21 , χ23 , χ25 , χ210 , and χ220 distribution
are approximately 3.84, 7.81, 11.07, 18.31, and 31.41.
√
We note that under H0 , the asymptotic variance of T − ρ() is exactly the
same as the variance of Ψ(εt , εt− ), so that we have a simple estimate of Σ(), i.e.
1
T
T () =
Σ Ψ (),
t ()Ψ (6.81)
T − t
t=+1
1
T
T (M ) =
Ξ (1), . . . , Ψ
Ψ (M ) (1), . . . , Ψ
Ψ (M ) . (6.82)
T −M t t t t
t=M +1
(i,j)
test, we consider the class of power-transformed-based correlations ρε ()’s
with
(i,j) (i,j)
Replacing ρε () by ρε
(), Table 6.2 shows values of the test statistics
(i,j) (i,j) (2,2)
CT () for = 1, 3, and 5 and QT (5) (i, j = 1, 2). Except for CT (1)
in the case of a SETAR(2; 1, 1)–GARCH(1, 1) model, none of the reported
values are significant at the 5% nominal level; hence, we conclude that the
standardized residuals are serially uncorrelated. This suggests that a simple
SETAR model is capable of describing the DGP. The fit of a more complicated
model, as in Example 6.6, does not seem to be needed.
T
f (y; θm ) = ft−1 (Yt ; θm ), (6.84)
t=1
where ft−1 (Yt ; θm ) ≡ f (Yt ; θm |F t−1 ) is the conditional density function of {Yt , t ∈ Z}
given F t−1 = σ(Y0 , Y1 , . . . , Yt−1 ), the σ-algebra generated by the random variables
{Y0 , Y1 , . . . , Yt−1 }, θm ⊂ Rm an m-dimensional parameter vector, and where Y0
represents the initial model values. Then, according to Dunn and Smyth (1996), the
theoretical quantile residual is defined by
where Φ−1 (·) is the inverse CDF of the N (0, 1) distribution, and Ft−1 (Yt ; θm ) =
+ Yt
−∞ ft−1 (u; θm )du is the conditional CDF of {Yt , t ∈ Z}, also called the probability
3
This is, for instance, the case with the mixture AR (MAR) model (see, Exercise 7.7), and the
MAR–GARCH model (Wong and Li, 2000b, 2001).
6.3 DIAGNOSTIC CHECKING 241
rt,θ
T = Φ−1 Ft−1 (Yt ; θT ) , (6.86)
where θT (dropping the subscript m) is a QML estimate of θ0,m . Observe that
quantile residuals of linear and nonlinear AR models with normal errors are identical
to Pearson residuals.
and d and n are the dimensions of the domain and range of g. Different choices of
g lead to different test statistics.
Conditional on a vector with initial values Y0 , and assuming that the condi-
T density functions
tional T ft−1 (Yt ; θm ) exist, the log-likelihood function T (y, θ) =
t=1 t (Yt , θ) = t=1 log ft−1 (Yt ; θ) of the sample follows directly. Then, under
some fairly standard regularity conditions, Kalliovirta (2012) proves the following
CLT
1
T
D
√ g(Rt,θ
T ) −→ Nd 0, Ω), (6.87)
T t=1
where
Ω = GI(θ 0 )−1 G + ΨI(θ 0 )−1 G + GI(θ 0 )−1 Ψ + H, (6.88)
where G T = T −1 T ∂g(r
)/∂θ , Ψ
T = T −1 T g(r
)∂t (Yt , θT )/∂θ , and
t=1 t,θT t=1 t,θT
T = T −1 T g(r
)g(r
) . Based on (6.87), a general test statistic is defined
H t=1 t,θT t,θT
as
−d+1
T −d+1
T
1 −1
ST,d = g(rt,θ
T ) ΩT g(rt,θ
T ), (6.90)
T −d+1
t=1 t=1
242 6 MODEL ESTIMATION, SELECTION, AND CHECKING
Table 6.3: Three diagnostic test statistics based on univariate quantile residuals, as special
cases of the general test statistic ST,d .
where rt,θ
T = (rt,θ
T , . . . , rt−d+1,θ
T ) .4 Under H0 , and as T → ∞, (6.90) has an
asymptotic χ2n distribution; Kalliovirta (2012).
Table 6.3 shows three diagnostic test statistics, as special cases of (6.90). Note,
that the test statistic forresidual autocorrelation is based on uncentered sample
−
autocovariances (T −)−1 Tt=1 rt,θ
rt+,θ
. The test statistic for conditional hetero-
T T
− 2
skedasticity is based on the sample autocovariances (T −)−1 Tt=1 (r
−1)r2
,
t,θT t+,θT
while the normality test statistic builds on ideas suggested by Lomnicki (1961); see,
e.g., Section 1.3.1. Under H0 these test statistics are asymptotically distributed as
respectively χ2K1 , χ2K2 , and χ23 .
Ground surface − =X
Water table
Figure 6.9: Schematic view of a water table relative to the ground surface elevation, called
“water table depth” (denoted by Yt ), with as input variable “precipitation excess” (denoted
by Xt ), i.e. the difference between precipitation and evapotranspiration.
the process {(Yt , Xt ), t ∈ Z}, with the regime switching depending on Yt rather than
Xt , can capture the nonlinear relationships of the hydrologic system successfully.
Adopting a similar notation as for the subset SETARMA model in (6.16), a k-regime
SSTARSO model is defined as
k
pi
qi
(i) (i) (i) (i)
Yt = φ0 + φ (i) Yt−ju + ψ (i) Xt−hi + εt I(Yt−d ∈ R(i) ), (6.91)
ju hv
i=1 u=1 v=0
(i)
where εt = σi2 εt (i = 1, . . . , k), {εt } ∼ (0, 1), and R(i) = (ri−1 , ri ] with r0 = −∞
i.i.d.
where Ti is the number of observations that belong to the ith regime, and σ T2i the
corresponding residual variance. If no prior information is used on the values of the
5
Calibration refers to the statistical consistency between the distributional forecasts and the
observations, and is a joint property of the forecasts and the observed values.
244 6 MODEL ESTIMATION, SELECTION, AND CHECKING
(ii) Select an interval [r, r] in which the thresholds are searched, or the combin-
ation of threshold values if there are more than two regimes. For instance,
take the 10th percentile and the 90th percentile of the empirical distribution
of {Yt }Tt=1 respectively.
(iii) To guarantee that there are enough observations in each regime, search r’s
at a fixed interval (here 1 cm) between r and r such that within each ith
regime Ti ≥ 20. This results in a set of, say R (combinations of) candidate
threshold values r1 , . . . , rk−1
(i) (i)
(iv) Select candidate subsets for the non-zero coefficients φu and ψv , say subsets
{sj }, where j = 1, . . . , K denotes the jth of K subsets. Assign to these
(i) (i) (i) (i) (i)
subsets the lags j1 , . . . , jpi , h0 , h1 , . . . , hqi of the AR terms in the output
and input series in the ith regime. Given k regimes, fixed threshold values,
and a fixed delay, there are S = K k candidate SSTARSO models to represent
the process {Yt , Xt }. Below we set Pi = 3, Qi = 2 (i = 1, 2, 3), and K = 25.
The sample standard deviations of the residuals are 7.15, 8.65, and 6.13, respectively.
Thresholds are estimated at −57 cm and −47 cm. The 95% asymptotic confidence
intervals of ri (i = 1, 2, 3) are estimated from 10,000 BS replicates. The skewness
of the intervals is a result of the short distance of the threshold at −47 cm to the
upper limit of the range in which thresholds are searched; only 21 observations are
present in regime 3. Similarly, thresholds are selected more often below than above
−57 cm.
6.4 APPLICATION: TARSO MODEL OF A WATER TABLE 245
Figure 6.10: Results of SSTARSO model selection in the calibration period. Observed
water table depth (blue dots), intervals in which 95% of the simulated water table depths fall
(black dashed lines), and selected thresholds (red solid lines). From Knotters and De Gooijer
(1999).
It is interesting to note that the estimated threshold values are possibly related
to the drainage level of the trench at about −40 cm. The estimated AR–coefficient
for {Xt } in regime 3 is small as compared with those in the other two regimes
(3.01 versus 6.81, 7.69). In physical terms the value 3.01 means that, starting from
equilibrium conditions, a unit change of the precipitation excess at time t causes a
change of 3.01 units in the water table depth {Yt }. Further, note that {Xt } is the
average daily precipitation excess between t − 1 and t. A physical explanation of
the relatively small AR–coefficient for {Xt } in regime 3 may be that the fluctuation
of the water table in regime 3 is damped by the drainage to the trench. This effect
can be seen in Figure 6.10, which shows a plot of the observed water table depth in
the calibration period and the interval in which 95% of the simulated water table
depths fall, using a set of 720 BS replicates of {Yt }. Note that the graph shows a
clear seasonal behavior, with a seasonality of 24 semi-monthly time steps.
Model-validation
To compare the performance of the SSTARSO model, we employ a transfer function
model with added noise (TFN). Within the present context, it consists of a functional
relationship between YtF and a noise process NtF . Here YtF denotes that part of the
water table depth Yt which is explained by the precipitation surplus Xt , and NtF is
modeled in its own right by an ARMA process. More specifically, the TFN model
fitted to the data in the calibration period (minimizing BIC) is given by
where
with residual sample standard deviation σ ε = 8.57, and asymptotic standard errors
are given in parentheses.
Based on (6.91) and (6.94), we generate 1,000 series of length T = 120 and
compute the mean error (ME), the root mean squared error (RMSE) and the mean
absolute error (MAE) using data on {Yt } from the validation period.6 The values of
these measures for the SSTARSO model, and in parentheses the fitted TFN model,
are: ME = −0.3 (1.7), RMSE = 15.3 (16.3), and MAE = 12.3 (13.2). Clearly,
the fitted SSTARSO model performs better than the fitted linear TFN model. The
percentages of observations outside the interval in which 95% of the simulated water
table depths fall are 8 (SSTARSO) and 13 (TFN), respectively. Thus, the fitted
SSTARSO model provides an adequate representation. Moreover, the model can be
interpreted with respect to the hydrological conditions at the well location.
posed test statistics make an explicit correction for effects of estimation uncertainty.
Modified versions of these test statistics may also be used to check the null hypo-
thesis of serial independence in the original series because the estimation error’s
effect is irrelevant in this case. In the next chapter, we will take up the topic of
testing for serial independence in time series again, this time in a nonparametric
setting.
estimators is quite different from the CLS estimators based on the linear counterpart of
these models.
De Gooijer (1998) considers ML estimation of TMA models. Under some moderate condi-
tions, Li et al. (2013) show that the estimator of the threshold parameter in a TMA model, is
n-consistent and its limiting distribution is related to a two-sided CPP, while the estimators
of the other coefficients are strongly consistent and asymptotically normal.
Using the rearranged autoregressions, Coakley et al. (2003) introduce an efficient SETAR
model estimation approach which relies on the computational advantages of QR factorization
of matrices. Aase (1983) considers recursive estimation of nonlinear AR models. Zhang et al.
(2011) discuss QML estimation of a two-regime SETAR–ARCH model with the conditional
variance process depending on past time series observations. Koul and Schick (1997) propose
adaptive estimators for the SETAR(2; 1, 1) and the ExpAR(1) model with known parameter
γ, without sample splitting. These estimators have better performance (i.e. smaller MSEs)
than estimators based on the sampling splitting technique.
Hili (1993, 2001, 2003, 2008a,b) considers the minimum Hellinger distance (MHD) (see
Chapter 7) for estimating the parameters of the ExpARMA model (2.20), the simultan-
eous switching AR model, the general BL model (2.12), the SETAR(k; p, . . . , p) model, and
nonlinear dynamical systems, respectively. Under some mild conditions he establishes con-
sistency and asymptotic normality of the resulting parameter estimates. It is interesting
to note that the practical feasibility of employing the MHD method covers many areas,
including nonparametric ML estimation, and model selection criteria.
The theory of asymptotically optimal estimating function for stochastic models proposed
by Godambe (1960, 1985) has been used as a framework for finite-sample nonlinear time
series estimation. Thavaneswaran and Abraham (1988) construct G estimators (named
after Godambe) for RCAR, doubly stochastic time series, and SETAR models; see also
Chandra and Taniguchi (2001). These latter authors show that G estimators are better than
CLS estimation by simulation. Amano (2009) obtains similar results for NLAR, RCAR,
and GARCH models. Here, it is also appropriate to mention the generalized method of
moments (GMM) developed by Hansen (1982) which is a widely used estimation method
in econometrics. In fact, GMM estimation and Godambe’s estimation function method are
essentially the same. Caner (2002) obtains the asymptotic distribution for the least absolute
deviation estimator of the threshold parameter in a threshold regression model.
For the CLS-based estimator of the BL model in (6.35), an expression for the asymptotic
variance is given by Giordano (2000) and Giordano and Vitale (2003), assuming E(Yt8 ) < ∞.
This condition restricts the permissible parameter space considerably. Kim and Billard
(1990) derive the asymptotic properties of the moment estimators of the parameters in a
first-order diagonal BL model extended with a linear AR(1) term. This model is also the
focus of a study by Ling et al. (2015). These authors propose a GARCH-type ML estimator
for parameter estimation which is consistent and asymptotically normal under only finite
fourth moment of the errors.
Outliers pose serious problems in time series model identification and estimation proced-
ures. Gabr (1998) investigates the effect of additive outliers (AO) on the CLS estimation of
BL models. For SETAR models, Chan and Cheung (1994) modify the class of generalized
M-estimates. Their approach, however, can lead to inconsistent and very inefficient estim-
ates of the threshold parameter even when the model is correctly specified and the errors
6.6 ADDITIONAL BIBLIOGRAPHICAL NOTES 249
are normally distributed (Giordani, 2006). Battaglia and Orfei (2005) propose a model-
based method for detecting AO and innovational outliers (IO) in general NLAR time series
processes.
Traditional likelihood analysis of threshold models is complicated because the threshold
parameters can give rise to unknown shifts at arbitrary time points. On the other hand, the
problem of estimating these parameters may be formulated into a Bayesian framework, and
apply the Gibbs sampler (Geman and Geman, 1984), an MC simulation method, to obtain
posterior distributions from conditional distributions. Amendola and Francq (2009, Section
7) briefly review MCMC methods, in particular the Metropolis–Hastings algorithm (Met-
ropolis et al. (1953) and Hastings (1970)) and the Gibbs sampler for fitting STAR models.
These authors also provide tools and approaches for nonlinear time series modeling in econo-
metrics; see the website of this book. The function metrop in the R-mcmc package, and the
function MCMCmetrop1R in the R-MCMCpack package can be used to perform a Bayesian
analysis. Gibbs sampling, being a special case of the Metropolis–Hastings algorithm, is in-
cluded in the R-gibbs.met package; see Robert and Casella (2004) for more information on
MCMC methods.
Section 6.2: Sub-section 6.2.2 is partly based on Van Casteren and De Gooijer (1997).
Using knowledge of the asymptotic properties of the CLS estimator for the SETAR model,
Wong and Li (1998) show that AICc is an asymptotically unbiased estimator for the KL
information. Kapetanios (2001) compares the small-sample performance of KL information-
based model selection criteria for Markov switching, EDTAR, and two-regime SETAR mod-
els. A similar, but more extensive study, is undertaken by Psaradakis et al. (2009). Hamaker
(2009) investigates six information criteria for determining the number of regimes in two-
regime SETAR models. For small samples AIC u should be preferred. Rinke and Sibbertsen
(2016) compare regime weighted and equally weighted information criteria for simultaneous
lag order and model class selection of SETAR and STAR models. Overall, in large samples,
equally weighted criteria perform well.
Simonoff and Tsai (1999) derive and illustrate the AIC c criterion for general regression mod-
els, including semiparametric and additive models. The MDL principle has been successfully
applied to a wide variety of model selection problems in the fields of computer science, elec-
trical engineering, and database mining; see, e.g., Grünwald et al. (2005). Good tutorial
introductions are provided by Bryant and Cordero–Braña (2000), Hansen and Yu (2001),
and Lanterman (2001). Qi and Zhang (2001) investigate the performance of AIC and BIC
in selecting ANNs.
Öhrvik and Schoier (2005) propose three bootstrap criteria for two-regime SETAR model
selection. Chen (1995) considers threshold variable selection in TARSO models. Chen et al.
(1997) propose a unified, but computationally intensive, approach for model estimation via
Gibbs sampling and to select an appropriate (non-nested) nonlinear model; see also Chen et
al. (2011a). However, the correct specification of potentially non-nested nonlinear models
and/or priors is not an easy task (Koop and Potter, 2001).
Based on the superconsistency of the SETAR–CLS threshold estimate established by Chan
(1993), Strikholm and Teräsvirta (2006) provide a simple sequential method for determining
the number of thresholds using general linearity tests. In addition, they compare their
method with the approaches suggested by Gonzalo and Pitarakis (2002) (cf. Exercise 5.4(b))
and Hansen (1999).
Olteanu (2006) uses Kohonen maps and hierarchical clustering of arranged autoregressions
to determine the number of regimes in switching AR (TAR and Markov switching) models.
250 6 MODEL ESTIMATION, SELECTION, AND CHECKING
Bermejo et al. (2011) propose an automatic procedure to identify SETAR models and to
specify the values of thresholds. The method is based on recursive estimation of time-varying
parameters in an arranged autoregression.
Dey et al. (1994) and Holst et al. (1994) consider ML estimation via recursive EM algorithms
of switching AR(MAX) processes with a Markov regime. Krishnamurthy and Yin (2002)
study the convergence and rate of convergence issues of these algorithms; see also Douc et
al. (2014, Chapter 13 and Appendix D) on stochastic approximation EM algorithms.
Section 6.3: Li (2004, Sections 6.3 and 6.4) provides a comprehensive review on various
diagnostic test statistics for ARCH and multivariate ARCH models. Li (1992) derives the
asymptotic distribution of residual autocorrelations for a general NLAR model with strict
WN errors. Hwang et al. (1994) extend this result to NLAR with random coefficients. Baek
et al. (2012) derive the joint limit distribution of the sample residual ACF for NLAR time
series models with unspecified heteroskedasticity. Based on this result they propose a test
(1,1)
statistic which is an analogue of the test statistic CT ().
An and Cheng (1991) introduce a KS-type test statistic based on the predicted residuals
obtained by the best linear predictor for a NLAR process where the noise process follows
a stationary martingale difference. The limiting distribution of the test statistic depends
on the estimates of the unknown parameters of the AR(p) model considered under the null
hypothesis. As an alternative, Kim and Lee (2002) propose a new KS test statistic and an
associated BS procedure, which outperforms the original one. Hjellvik and Tjøstheim (1995,
1996) develop a nonparametric test statistic based on the distance between the best linear
predictor and a nonlinear predictor obtained by kernel estimates of the conditional mean and
conditional variance. However, to avoid the “curse-of-dimensionality”, the conditional mean
and variance functions only depend on {Yt−i } (i = 1, . . . , p) rather than on {Yt−1 , . . . , Yt−p }.
The difficulty which then emerges is that consistency of the resulting test statistic no longer
holds. Also, Hjellvik et al. (1998) consider local polynomial estimation as a useful alternative
to kernel estimation. Deriving asymptotic properties of the resulting linearity test statistic
is, however, complicated.
An and Cheng (1991) and An et al. (2000) construct a CvM type test statistic which is
simple to compute and partly avoids the curse of dimensionality problem when p is large.
For time series generated by (6.73), Ling and Tong (2011) develop GOF test statistics that
are based on empirical processes marked by certain scores. The tests are easy to implement,
and are more powerful than other, residuals-based, test statistics.
Software References
Sections 6.1.1: Tong (1983, Appendices A7 – A21) offers FORTRAN77 functions for
testing, estimation, and evaluation of SETAR models. Some of these functions are rather
dated. They are included in the interactive STAR package, to accompany the book by Tong
EXERCISES 251
(1990). Unfortunately, the STAR package is no longer available for sale. However, with the
consent of Howell Tong, the DOS-STAR3.2 program as an executable file (32-bit) is made
available at the website of this book. Alternatively, the R-TSA package, supporting results in
the textbook by Cryer and Chan (2008, Chapter 15), may be adopted for analyzing SETAR
models; see also the R-tsDyn package mentioned earlier in Section 2.14.
RSTAR is a package for smooth transition AR modeling and forecasting; see https:
//www.researchgate.net/publication/293486017_RSTAR_A_Package_for_Smooth_
Transition_Autoregressive_STAR_Modeling_Using_R. Alternatively, smooth transition
regression (STR) models can be specified, estimated and checked in the freely available,
and menu-driven, computer package JMulTi; see also Section 9.5. An EViews7 add-in
for STR analysis is available at http://forums.eviews.com/viewtopic.php?f=23&t=
11597&sid=e01abc77f3732bfcdebcf2bce8dd1888. Another option is the Ox-STR2 pack-
age8 (see http://www.doornik.com/download.html) based on Timo Teräsvirta’s GAUSS
code; see, also, http://people.few.eur.nl/djvandijk/nltsmef/nltsmef.htm.
Section 6.2.6: MATLAB code for comparing the performance of the various order selection
criteria discussed in this section is available at the website of this book.
Section 6.3.1: The test results in Table 6.2 are computed using a GAUSS code provided
by Yi-Ting Chen. The code is also available at the Journal of Applied Econometrics
Data Archive.
Section 6.3.2: MATLAB codes for computing the test statistics AT,K1 and HT,K2 are
available at the website of this book (file: Exercise 77b.zip).
Section 6.4: The paper by Knotters and De Gooijer (1999) contains (SS)TARSO mod-
els for time series of semi-monthly observed water table depths from six observation wells.
The application only shows (SS)TARSO results for the first well. As a companion to the
above paper, the website of this book offers FORTRAN77 codes for (SS)TARSO model
identification and estimation.
Exercises
Theory Questions
6.1 Consider the simple BL model (6.35). Given the series of observation {Yt }Tt=1 , the
CLS estimator τ of the model parameter τ is defined by (6.39). Giordano (2000)
proposes another estimator of τ , defined as
τ = γ
Y (1, 2)/σε2 Var(Yt ),
T
where γY (i, j) = T −1 t=1 Yt Yt−i Yt−j (Yt = 0, t < 0) is an estimator of the third-
order cumulant E(Yt Yt−i Yt−j ) (i = 1, 2), and Var(Yt ) = σε2 /(1 − τ 2 σε2 ). Assume σε2
and σY2 are known, and let τ 4 σε4 < 1/3. Then show that
|
τ − τ| → 0 a.s., and |
τ − τ| = O(ST ),
7
R
EViews (Econometric Views) is a software package for Windows, used mainly for econometric
time series analysis. It was developed by Quantitative Micro Software, now a part of IHS.
8
OxMetricsR
is a commercial package using an object-oriented matrix programming language
with a mathematical and statistical function library; published and distributed by http://www.
timberlake.co.uk/software/oxmetrics.html. The downloadable Ox Console may be freely used
for academic research and teaching purposes.
252 6 MODEL ESTIMATION, SELECTION, AND CHECKING
Yt = Ut,m + Wt,m ,
where
m
j ∞
j
Ut,m = εt + τ εt− εt−j , Wt,m = τ εt− εt−j , (m = 1, 2, . . .).
j=1 =1 j=m+1 =1
(b) Compare the ACF of the BL(0, 0, 1, 1) process with the ACF of an invertible
MA(1) process having the same innovation process as above. What do you
conclude?
(c) Show that the BL process is invertible if the condition |λ| < 0.605 holds.
T
(d) Given the observations {Yt }Tt=1 . Let UT = T −1 t=1 Ut,m . Prove that, as
T → ∞,
√
m
D
T (UT − μU ) −→ N 0, σε2 1 + λ2 + 3 λ2j ,
j=1
where E(Ut,m ) = μU .
(e) Assume σε2 is known. Kim et al. (1990) estimate the parameter τ by the method
of moments. Their moment estimator τ is given by
τ = Y T /σε2 ,
T
where Y T = T −1 t=1 Yt . Using the results in steps (a) and (c), prove that as
T → ∞,
√ 1 + 3τ 2 − τ 4
D
τ − τ ) −→ N 0,
T ( .
1 − τ2
T T
[Hint: Define Qm,T = T −1/2 t=1 (Ut,m − μY ) and Rm,T = T −1/2 t=1 Wt,m ,
√
with μY = E(Yt ) = τ σε2 . Then consider the asymptotic distribution of T (Y T −
μY ).]
(a) Show the (i, j)th element of the asymptotic variance-covariance matrix Σ() =
IP Q + A(, ) in (6.77) becomes
1 1 1 1 − λ2
τ) ≈
Var( 183λ6 + 42λ4 + 14λ2 + 1 ,
T σε 1 − 15λ 1 − 3λ
2 6 4
1 λ2
τ ) ≈ (1 − λ2 ) 1 + 22λ2 + 9τ 2 σε2 − 6
Var( .
T 1 − λ2
Assume σε2 = 1. Based on 1,000 MC replications, compute 95% coverage probabilities
of both estimators τ and τ for T = 1,000, using τ = ±0.1, ±0.4 and ± 0.6. In addi-
tion, with the above specifications, compute the average length of the 95% confidence
interval for both estimators. Compare and contrast the two estimators on the basis
of the simulation results.
6.6 Consider the BL model of Exercise 6.2. If σε2 is known, it follows from E(Yt ) = τ σε2
that the moment estimator of τ is given by Y T /σε2 . The solution of Exercise 6.2(c),
contains an expression for σε2 in terms of γY (0) and γY (1). Using this expression, and
assuming σε2 is unknown, Kim et al. (1990) propose the following method of moment
estimator τ∗ of τ
2Y T
τ∗ = ,
{
γY (0) − γ
Y (1)} + {
γY2 (0) − 6 γY (1) − 3
γY (0) γY2 (1)}1/2
T −
where γY () = T −1 t=1 (Yt −Y T )(Yt+ −Y T ) is the lag sample ACVF, with normal-
izing constant T −1 instead of (T −)−1 . They show that T 1/2 (
τ ∗ −τ ) is asymptotically
normally distributed with mean zero and with a lengthy expression for the variance.
(a) Based on 1,000 MC replications, compute the mean of the moment estimator τ∗
for T = 500 and 1,000, using τ = ±0.2 and ±0.4 as the parameters of the DGP.
Also, compute the mean of the CLS estimator τ of τ .
254 6 MODEL ESTIMATION, SELECTION, AND CHECKING
(b) For comparison purposes, compute the bootstrap mean and standard deviation
of τ∗ and τ, using 1,000 BS replicates and with the same data sets and specific-
ations as in part (a). Comment on the obtained simulation results.
(a) Using the R-tsDyn package, generate 100 times series of length T = 200 of this
model, with starting condition Y0 = 0. Check the local stationarity of the
LSTAR model.
(b) Compute the sample distribution of the six parameter estimates. Comment on
the outcomes.
(c) Optional: If the S-Plus FinMetrics commercial software package is available,
repeat part (a). Compare the outcomes with those obtained in part (b).
6.9 As a part of the diagnostic checking stage, it is common to check the normality
assumption. The data file Example62 res.dat contains the SETAR residuals of model
(6.15).
(a) Using the Lin–Mudholkar test statistic (1.7), test the SETAR residuals for nor-
mality.
(b) Doornik and Hansen (2008) propose an omnibus test statistic for testing uni-
variate or multivariate normality; see, e.g., the function normality.test1 in the
R-normwhn.test package. Using this test statistic, investigate the normality as-
sumption of the SETAR residuals.
Also, perform the Doornik–Hansen test using the function normality.test2. The
associated test statistic allows for time series variables which are weakly depend-
ent rather than i.i.d. Explain the differences with the results from part (a) if
there are any?
(c) Relatively little is known about the finite-sample performance of diagnostic test
statistics applied to residuals of fitted nonlinear time series models. This ques-
tion explores this issue through a small MC simulation experiment. In particular,
consider the SETAR(2; 1, 1) model
0.3 − 0.5Yt−1 + σ1 εt if Yt−1 ≤ 0,
Yt =
−0.1 + 0.5Yt−1 + σ2 εt if Yt−1 > 0,
√
where (i) σ1 = σ2 = 1 (homoskedastic case), and (ii) σ1 = 2, σ2 = 1 (hetero-
i.i.d.
skedastic case), and {εt } ∼ N (0, 1).
EXERCISES 255
Testing for randomness of a given finite time series is one of the basic problems of
statistical analysis. For instance, in many time series models the noise process is
assumed to consist of i.i.d. random variables, and this hypothesis should be testable.
Also, it is the first issue that gets raised when checking the adequacy of a fitted
time series model through observed “residuals”, i.e. are they approximately i.i.d.
or are there significant deviations from that assumption. In fact, many inference
procedures apply only to i.i.d. processes.
In Section 1.3.2, we noted that the traditional sample ACF and sample PACF are
rather limited in measuring nonlinear dependencies in strictly stationary time series
processes. As a result a wide variety of alternative dependence measures have been
proposed, often resulting in test statistics which have appealing statistical properties.
Broadly, these test statistics can be divided into two categories: those designed with
a specific nonlinear alternative in mind – such as the time-domain test statistics
discussed in Chapter 5 – and serial independence tests. When the parameters of the
fitted model are known, these latter tests are useful to detect neglected structure
in residuals. In reality, however, the model parameters are unknown. This has
motivated the development of nonparametric test statistics for serial independence.
In fact, over the past few years, enormous progress has been made in this area.
In this chapter, we consider both historic and more recent work in the area
of nonparametric serial independence tests for conditional mean models. In the
next section, we start off by expressing the null hypothesis of interest in various
forms. In Section 7.2, we introduce a number of distance measures and dependence
functionals. Jointly with a particular form of the null hypothesis, these measures
and functionals are the “backbone” for constructing the test statistics in Sections 7.3
and 7.4. Here, we distinguish between procedures for testing first-order, or single-
lag, serial dependence (two dimensions), and high-dimensional tests. Throughout
the chapter, a number of examples illustrate the performance of the proposed test
statistics on empirical data. In Section 7.5, this is complemented with an application
of high-dimensional serial independence test statistics to a famous data set.
To facilitate reading, technical details will be kept to a minimum. They are only
provided to understand the main premises underlying the construction of the test
statistics. In particular, three technical appendices are added to the chapter. In Ap-
pendix 7.A, we briefly discuss kernel-based density and regression estimation in the
simple setting of i.i.d. DGPs. Many of the nonparametric methods discussed in this
chapter are direct generalizations of this case. In Appendix 7.B, we present a general
overview of copula theory. Finally, in Appendix 7.C, we provide some information
about the theory of U- and V-statistics. These notions are often mentioned in this
chapter as useful ways to derive asymptotic theory of certain test statistics.
H0 : {Yt } ∼ μ,
i.i.d.
(7.1)
where μ is some probability measure on the real line associated with {Yt , t ∈ Z}. In
practice, it will not be easy to uniquely determine dependencies in a set of observed
time series data given the above setup. Rather than focusing on a single time series
in R, it is practical to consider a time series process in Rm , which at lag , is given
by
H0 : μ(1) (2)
m = μm (m ∈ N+ ), (7.2)
()
Moreover, if {Yt , t ∈ Z} admits a continuous distribution function Fm (y), the
above hypothesis can also be formulated in terms of joint and marginal distribution
functions, i.e.,
m
D (u) = φ (u) − φ(uk ), = 0, ±1, . . . . (7.5)
k=1
H0 : D (u) = 0, ∀u ∈ Rm . (7.6)
m
fm (y) = c F (y1 ), . . . , F (ym ) f (yi ), (7.7)
i=1
∂ m C(u) fm (u)
c(u) = = m , u ∈ [0, 1]m . (7.8)
∂u1 × · · · × ∂um i=1 f (ui )
H0 : c(u) = 1. (7.9)
For each of the null hypotheses specified above any deviation from the corresponding
equality is evidence of serial dependence.
260 7 TESTS FOR SERIAL INDEPENDENCE
Cm,Y (h) = I(
y − x
≤ h)dμm (y)dμm (x), (7.10)
Rm Rm
2
Within the information theoretic literature the symbol
is often used for the bandwidth, also
called tolerance distance or cut-off threshold.
7.2 DISTANCE MEASURES AND DEPENDENCE FUNCTIONALS 261
Figure 7.1: Three kernel functions (left panel) and their associated FTs (right panel):
Gaussian (black solid line), squared Cauchy (blue medium dashed line), and uniform (red
dotted line).
ΔQ (m) =
μ(1)
m − μm
= (μm , μm ) − 2(μm , μm ) + (μm , μm ),
(2) 2 (1) (1) (1) (2) (2) (2)
(7.13)
where
(μ(i) (j)
m , μm ) = Kh (y − x)dμ(i) (j)
m (y)dμm (x), (i, j = 1, 2),
Rm Rm
seen that the functional (μ(1) , μ(1) ) − (μ(2) , μ(2) ) with the ‘naive’ or identity kernel
function Kh (z) = I(|z| < h) corresponds to (7.11).
Because FTs leave the L2 -norm invariant by Parseval’s identity (loosely speaking
the sum or integral of the square of a function is equal to the sum or integral of the
square of its FT), we can express (7.13) as
Q
Δ (m) = Kh (y − x)d(μ(1) m − μm )(y)d(μm − μm )(x)
(2) (1) (2)
R R
m m
Figure 7.2: Distance ΔQ (2) between a bivariate standard normal distribution and a cor-
related bivariate normal distribution with correlation coefficient ρ, for different values of
h.
1 1 2 1
ΔQ (2) = − + 2 . (7.15)
4π (h2 + 1)2 − ρ2 (h2 + 1)2 − ρ2 /4 h + 1
Figure 7.2 shows ΔQ (2) for bandwidths h = 0.2, 0.3, 0.5, and 1.0 as a function
of |ρ|. Note from (7.15) that, as h → 0, the limiting squared distance function
is well-defined which need not be the case for other combinations of kernel
functions and pdfs.
where B(·, ·, ·) is a real-valued function, and the integrals are taken over the support,
say S 2 , of (Yt , Yt− ) .
Several functionals have been proposed in the information theory literature.
Roughly, the resulting measures can be classified in four major categories:
• Generalized Kolmogorov (K) divergence measure
q 1/q
K
Δq () = f (x, y) − f (x)f (y) dxdy , (q > 0),
S2
which for q = 1 is the L1 -norm. ΔqK (·) satisfies properties (i) – (ii), but not
(iii).
• Csiszár (C) (1967) divergence measure
f (x, y)
C
Δ () = φ f (x, y)dxdy,
S2 f (x)f (y)
where φ(·) is some strictly convex function on [0, ∞). Thus, B{z1 , z2 , z2 }
≡ φ(z1 /z2 z3 ).
• Rényi (R) (1961) divergence measure
q−1 q
R 1
Δq () = log f (x, y) f (x)f (y) dxdy, (0 < q < 1).
q−1 S2
It is easy to see that the Hellinger distance is symmetric, and hence it can serve as
a distance measure contrary to other divergences. 3
In addition, various relations exist between the divergence measures. For in-
stance, Rényi’s information divergence follows from Csiszár’s measure by taking
φ(u) = sign(u − 1)uq (u ≥ 0; q = 1) which yields ΔRq (·) = (q − 1)−1 log |ΔCq (·)|.
The connection between Rényi’s measure and Tsallis’ measure is given by Δ Rq (·) =
(q −1)−1 log[1+(1+q) log ΔqT (·)]. Clearly, when φ(·) is taken as the logarithmic func-
tion, Csiszár’s measure is equivalent to the KL information measure I KL (·). defined
in (1.18). Moreover, I KL (·) ≡ ΔT1 (·) and ΔT1/2 (·) ≡ ΔH (·).
2 F (x)F (y) q
ΔCR
q () = F (x)F (y)
q+1 F (x, y)
1 − F (x)F (y) q
+ 1 − F (x)F (y) −1 .
1 − F (x, y)
The Cressie–Read measure and Rényi’s divergence measure are related:
2 F (x)F (y) 1 − F (x)F (y)
ΔCR
q () = exp q ΔRq+1 R
+ Δq+1 −1 .
q+1 F (x, y) 1 − F (x, y)
counterparts, the CvM–GOF test statistic (4.38) can be obtained. Another well-
known functional follows from setting q = 1 and w (x, y) = F (x, y)(1 − F (x, y)) in
Cqmax (·), i.e.,
2 2
ΔKS () = sup |F (x, y) − F (x)F (y)| ,
S2
where ΔKS (·) is the Kolmogorov–Smirnov (KS) divergence measure. This measure
satisfies the basic properties (i) – (iii). Setting q = 1 and w (x, y) = dF (x, y) in
Cq (·) generates the Anderson–Darling (AD) functional
2 −1
AD
Δ () = F (x)F (y)−F (x, y) F−1 (x, y) 1−F (x, y) dF (x, y),
S2
which, after evaluating the integral and some algebra, leads to (4.39).
All the above measures consider the distance between two-dimensional densities
or two-dimensional distribution functions at a single-lag . However, for testing
()
H0 : f (Yt , Yt− ) = f (Yt )f (Yt− ), it is possible that two different lags may give
conflicting conclusions. It is thus desirable to have a multiple-lag testing procedure.
One simple procedure is to form M linear combinations of single-lag two-dimensional
test functionals Δ(), i.e.
1
M
Q(M ) = √ Δ(), (M ∈ N+ ), (7.19)
M =1
Test statistics derived from (7.19) are portmanteau-type tests. Alternatively, one
may use the Bonferroni correction procedure, based on the p-values of the indi-
vidual single-lag serial correlation test statistics. Notice, however, that pairwise
(serial) independence for all combinations of paired random variables does not im-
ply joint (serial) independence in general. Hence, methods for the detection of serial
dependence in m > 2 dimensions are needed; see Section 7.4.
Recall that Tsallis’ divergence satisfies (i) – (iii). In line with its definition in
Section 7.2.3, it is easy to see that an m-dimensional copula-based (denoted by the
superscript c) version of ΔTq (·) is defined as
⎧ 1 1−q
⎪ 1
⎪
⎪ 1 − c(u)du (q = 1),
⎨ 1 − q [0,1]m c(u)
ΔT,c
m,q () = (7.21)
⎪
⎪
⎪
⎩ c(u) log[c(u)]du (q = 1),
[0,1]m
()
where c(u) is the copula density of {Yt , t ∈ Z}. It can be shown that ΔT,cm,q () ≥ 0
T,c ()
and Δm,q () = 0 if and only if the process {Yt , t ∈ Z} is serially independent.
m
Equivalently, ΔT,c
m,q (C) = 0 if and only if C(u) = Π (u), where Π (u) ≡ i=1 ui being
the independence copula (m ≥ 2).
Other m-variate copula-based measures can be obtained in a similar manner as
we previously applied to introduce the four major density-based measures as special
cases of the general functional (7.16). In particular, in terms of the m-dimensional
copula density, we have
c
Δm () = B{c(u, 1, . . . , 1)}du = B c {c(u)}c(u)du (7.22)
[0,1]m [0,1]m
1
T
f(y) = Kh (y; Yt ), (7.23)
T
t=1
√
where Kh (y; Yt ) = ( 2πh)−2 exp{−(y − Yt )2 /2h2 } with h > 0 the bandwidth. Sim-
ilarly, the Gaussian product kernel density is often used for estimating the bivariate
density function f (·, ·), i.e.,
T −
1
f (x, y) = Kh (x; Yt )Kh (y; Yt+ ). (7.24)
T −
t=1
1
T 1 T
CV (h) = log Kh (Yt ; Ys )I(s = t) , (7.25)
T T −1
t=1 s=1
where the term in curly brackets represents the kernel-based “leave-one-out” density
estimator.4 This produces a density estimate which is “close” to the true density in
terms of the KL information divergence.
4
As an aside, note that the local marginal density is usually not the main object of interest in
a testing context.
7.3 KERNEL-BASED TESTS 269
The boundedness of the support set S of (Yt , Yt− ) in the nonparametric entropy-
based divergence measures Δ CR T
q (·) and Δ1 (·) is a key assumption to establish the
asymptotic distribution theory of the resulting test statistics. Gaussian kernel estim-
ation suffers from so-called boundary effects with parts of the window devoid of data.
Such an effect can be diminished by, for instance, modifying the divergence measures
with a trimming function w(x, y) = I{(x, y) ∈ C} which selects only a compact set
C ⊆ S = S X × S Y . Two simple trimming functions, adopted by Fernandes and Néri
(2010) and Bagnato et al. (2014), are based respectively on the compact sets
C1u = {u : |u − u| ≤ 2
σu } and C2u = {u : ξ0.1 (u) ≤ u ≤ ξ0.9 (u)},
where
u denote the sample mean and sample standard deviation, while ξq (·)
u and
σ
q ∈ (0, 1) denotes the q-quantile of the empirical distribution. In addition, the
boundary effect can be corrected by using special boundary kernel density estim-
ators. Another widely-known way of nonparametric density estimation is to use
histogram methods. In the next section we discuss the histogram estimator within
the framework of high-dimensional copula estimation.
1
T
Fi,T (y) = I(Yi,t ≤ y), ∀y ∈ R. (7.26)
T +1
t=1
Next, the estimated marginal distribution functions are used to obtain the so-called
pseudo-observations , or PITs, U t = (U1,t , . . . , U
m,t ) with U
i,t = Fi,T (Yi,t ). Note,
residuals are just a special case of pseudo-observations. Finally an estimator of
C(u), called the empirical copula, is defined as
1
T
CT (u) = I(Ut ≤ u), u ∈ [0, 1]m . (7.27)
T
t=1
T
(T + 1)Fi,T (Yi,t ) ≡ Ri,t = I(Yj,t ≤ Yi,t ), (1 ≤ i ≤ m; 1 ≤ t ≤ T ).
j=1
Hence, any rank test of serial independence is a function of C T (u). Due to the
invariance property of the ranks, the empirical copula is invariant under strictly
monotonic increasing transformations of the margins.
270 7 TESTS FOR SERIAL INDEPENDENCE
and
1
Q N 1
Q
T,c
Δ1 () = Nq log
q
= Nq log Nq − log(T hm
b ) (7.29)
T T hm
b T
q=1 q=1
is a copula-based estimator of ΔT1 (·). The optimal value of hb , minimizing the mean
squared error, is of order O(T −1/(2+m) ); cf. Silverman (1986).
• Apart from Δ CT (·), the tests have an asymptotic normal distribution under
T,1
the null hypothesis of pairwise serial independence. Under weak regularity
conditions, it can be shown (see the cited references) that all tests are consist-
ent against lag one dependent alternatives. No limiting distribution theory is
CT (·) which has hindered its application in practice.
available for Δ T,1
7.3 KERNEL-BASED TESTS 271
f (Yt )f (Yt− )
t∈ST ()
"
⎫ 1 ! # # f
(Yt , Yt− ) %
Skaug and Tjøstheim (1993a) ⎬ ΔH
ST1 () =
Δ 2 1− $ wt ()
T − f
(Yt )f
(Yt− )
T
t∈ST ()
⎭
Skaug and Tjøstheim (1996) Δ∗
ST2 () = 1
Δ {f
(Yt , Yt− )
T
T −
t∈ST ()
−f
(Yt )f
(Yt− )}wt ()
& f
(Y , Y '
Granger and Lin (1994) 1−e −2I KL
GL () = 1 − exp −2
Δ log
t t− )
T − f
(Yt )f
(Yt− )
T
t∈ST ()
f (Yt , Yt− )
f (Yt )f
(Yt− )
1 T
T −
t∈ST ()
FN () = 1
Fernandes and Néri (2010) ΔT Δ ×
q∈{ 1
2
,1,2,4} T,q
(1 − q)(T − )
! f
(Y )f
(Y
t
t− ) 1−q
%
1− wt ()
t∈ST ()
f
(Yt , Yt− )
Distribution functions T −
ST3 () = 1
• The trimming function wt () is generally not needed for Δ ST1 (·) and Δ
ST2 (·).
T T
For i.i.d. data from the uniform distribution, wt () is needed to prevent degen-
eracy, because otherwise the asymptotic variance of the test statistics would
vanish to 0.
ST3 (·) utilizes the following unbiased estimators of the one-
• The test statistic Δ T
and two-dimensional EDF of {Yt }Tt=1 , respectively,
T −
1 1
T
FT (y) = I(Yt ≤ y), F,T (x, y) = I(Yt ≤ x)I(Yt+ ≤ y).
T T −
t=1 t=1
Observe, all test statistics in Table 7.1 have an equivalent integral representa-
tion. Also, using the copula-based measure (7.21) in conjunction with the copula
272 7 TESTS FOR SERIAL INDEPENDENCE
1
M
Q(M )= √ Δ(), (M ∈ N+ ), (7.30)
M =1
where, except for the test statistic proposed by Chan and Tran (1992), Δ(·) can
be one of the single-lag test statistics listed in Table 7.1. Hong and White (2005)
replaced by Δ
consider (7.30) with Δ(·) HW (·), R(·) (see Table 7.1, footnote (4)), and
T
2 (·). In each case the resulting portmanteau-type test statistic has an asymptotic
Δ ST
T
normal null distribution. Bagnato et al. (2014) only focus on the integrated Gaussian
kernel estimator of ΔT1 (·). These authors conclude that, as opposed to a simultaneous
test based on the Bonferroni procedure, the portmanteau-type test statistic is the
best choice since it preserves size across lags.
Using the CvM functional, Hong (1998) considers a modified version of the
portmanteau-type pairwise serial independence test statistic of Skaug and Tjøstheim
(1993b). That is,
M
H1 (M ) =
Q ST3 ().
(T − )Δ (7.31)
T
=1
Thus, similar as the well-known LB portmanteau-type test statistic for joint signi-
ficance of the first M serial autocorrelation coefficients, the test statistics Δ ST3 ()
T
( = 1, . . . , M ) are weighted. A sensible generalization of (7.31) is to include a
symmetric continuous window kernel λ(·) with λ(0) = 1. This ensures that the
asymptotic bias of the test statistic vanishes.
7.3 KERNEL-BASED TESTS 273
with the Daniell lag window λ(u) = sin(πu)/πu, which is optimal over a class of
window kernels that includes the Parzen window; see (4.18). Based on the theory
of degenerate V-statistics, it can be shown that (7.32) has a limiting N (0, 1) distri-
bution, under the null hypothesis of serial independence. A simple way to obtain
p-values is via the smoothed BS or permutation method; see Section 7.3.6 for details.
with residual variance σε2 = 0.2765. The sample residual ACF shows significant
(5% level) values at lags = 3, 4, 6, 7, 9, and 10. Clearly, it is likely that the
fitted model is not appropriate. To investigate this in more detail, we consider
ST2 () ( T ) and a standardized version of this test statistic, namely
Δ T
N (0, 1), as T → ∞. For the Gaussian kernel density estimators, we obtain the
bandwidth h through a data-driven bandwidth method; see, e.g., Hong and
White (2005, p. 859) and Bagnato et al. (2014).
Based on 1,000 bootstrap replicates, both test statistics Δ ST2 () and J T ()
T
have nearly zero p-values for all lags from 1 to 10. Moreover, the multiple-
lag portmanteau-type test statistics have p-values less than 0.05 for M = 2,
4, 6, and 8. All these test results indicate that the residuals are not serially
independent, suggesting that the fitted BL model is far from adequate.
Thus, under the null hypothesis of serial independence FY (ω) = ω, which is analog-
ous to a flat spectrum. Flat spectra, however, can result from nonlinear processes
which would be accepted as WN by a test statistic based on (7.33) with a high prob-
ability. For example, the BL process Yt = βεt−1 εt−2 + εt , where {εt } ∼ WN(0, σε2 ),
has γY () = 0 for > 0, hence estimates of the spectrum will be constant over all
frequencies ω.
As an alternative, Hong (2000) introduces two test statistics (denoted by the
superscripts H1 and H2 ) for pairwise serial independence using a generalized spec-
trum. The key idea of the generalized spectrum is to transform {Yt , t ∈ Z} via a
complex-valued exponential function
Yt −→ exp(iuYt ), u ∈ R,
and then consider the spectrum of the transformed process. Specifically, let φ(u1 ) =
E{exp(iu1 Yt )} be the marginal
characteristic
function of the process {Yt , t ∈ Z},
and let φ (u1 , u2 ) = E{exp i(u1 Yt + u2 Yt−|| ) } ( = 0, ±1, . . .) be the pairwise joint
characteristic function of {(Yt , Yt−|| )}. Then the lag ACVF of the transformed
processes is given by
2
γu1 ,u2 () ≡ Cov eiu1 Yt , eiu2 Yt−|| = φ (u1 , u2 ) − φ(uk ) ≡ D (u1 , u2 ), (7.34)
k=1
where D (·, ·) is defined by (7.5). If γu1 ,u2 () = 0 ∀(u1 , u2 ) ∈ R2 , then there is no
serial dependence between Yt and Yt−|| , otherwise there is. In other words, the null
hypothesis of interest is given by (7.6)
with m = 2.
Now, suppose that sup(u1 ,u2 )∈R2 ∞ =−∞ |γu1 ,u2 ()| < ∞, which holds under a
proper mixing condition. Then the FT of γu1 ,u2 ()
∞
fY (ω, u1 , u2 ) = γu1 ,u2 () exp(−2πiω), ω ∈ [0, 1], (7.35)
=−∞
exists. Because −∂ 2 fY (ω, u1 , u2 )/∂u1 ∂u2 |(0,0) = fY (ω), (7.35) is called a generalized
spectral density of {Yt , t ∈ Z}, although it does not have the mathematical properties
of a pdf. Similarly, a generalization of (7.33), is given by
∞
sin(πω)
FY (ω, u1 , u2 ) = γu1 ,u2 (0)ω + 2 γu1 ,u2 () , ω ∈ [0, 1], (7.36)
π
=1
1/2
T −1
sin(πω)
FT (ω, x, y) = γ
x,y (0)ω + 2 1− x,y ()
γ , (7.37)
T π
=1
where
x,y () = F,T (x, y) − FT (x, ∞)FT (∞, y), ( = 1, . . . , T − 1),
γ
with
T −
1
F,T (x, y) = I(Yt ≤ x)I(Yt+ ≤ y).
T −
t=1
The factor (1 − /T )1/2 in (7.37) is a small sample correction for weighting down
higher order lags .
Utilizing the CvM functional, the “summed version” of a test statistic for pair-
wise serial independence is given by
T − 1 2
T −1 T T
H1 =
Δ
γ () . (7.38)
FY Yt ,Ys
(π)2 T 2
=1 t=1 s=1
Note that both test statistics do not assume that the lag order M is known a priori.
This may be appealing, since for certain DGPs it is not obvious how to choose the
optimal lag order leading to the highest power of a particular serial independence
test statistic.
Under H0 , and assuming that the stationary process {Yt , t ∈ Z} has a continuous
marginal distribution function FY (·), it can be shown (Hong, 2000) that the test
statistics (7.38) and (7.39) are asymptotically distributed as, respectively,
∞
H1 −→
D 1 1 1
Δ Z2 (7.40)
FY
(iπ) (jπ) (lπ)2 ijl
2 2
i,j,l=1
and
∞ √ √ √
2 sin(iπω1 ) 2 sin(jπω2 ) 2 sin(lπω3 )
−→
Δ H2 D
sup Zijl , (7.41)
FY
(ω1 ,ω2 ,ω3 )∈[0, 1]3 i,j,l=1 (iπ)2 (jπ)2 (lπ)2
where {Zijl ; i, j, l ≥ 1} are i.i.d. N (0, 1) random variables. Both test statistics enjoy
the nuisance-parameter-free property , which ensures that their critical values and/or
p-values can be obtained by directly simulating Δ H1 and Δ H2 .
FY FY
276 7 TESTS FOR SERIAL INDEPENDENCE
∗,(b) ()}B .
(iii) Repeat step (ii) B times, to obtain {Δ b=1
1+
I Δ (0) ()
() ≥ Δ
p() = b=1
.
1+B
Permutation
When testing a composite hypothesis, an exact level MC test statistic can be ob-
tained by conditioning on an observed value of a minimal sufficient statistic under
the null hypothesis (Engen and Lillegård, 1997). By definition, the resulting distri-
bution does not depend on unknown parameters so that it can be used to simulate
data that have the same (exact) conditional distribution as the DGP under the null
hypothesis, given the sufficient statistic. Under the null hypothesis of pairwise serial
independence, the order statistics provide a minimal and sufficient statistic. To be
(0) (·) denote the value of the dependence functional conditioned on the
specific, let Δ
original data, and let {Δ (i) (·)}B be the set of “bootstrapped” test statistics ob-
i=1
tained from a random permutation of the original data. Then calculate the one-sided
p-value as
(i)
1+ B (0)
i=1 I Δ (·) ≥ Δ (·)
p(·) = . (7.42)
1+B
Thus, reject the null hypothesis of pairwise serial independence if p(·) < α, where α
is some pre-specified nominal significance level.
For multiple-lag tests, Diks and Panchenko (2007) advocate the following al-
gorithm.
(ii) Randomly permute B times the data, and build the B × M matrix B whose
(b) (0)
b th element is Δ () (b = 1, . . . , B; = 1, . . . , M ). Then assemble Δ
and B into the (B + 1) × M matrix
⎛ ⎞
Δ (0)
B = ⎝ .. ⎠.
B
1+
I Δ (i) ()
>Δ
pi () = k=0
, (i = 0, . . . , B; = 1, . . . , M ).
1+B
(iv) For each row of P select the smallest pi () and call it Ti , i.e.
(v) Adopt, T say, as a test statistic. Interpret T0 as its observed value and
the set {T1 , . . . , TB } as the values associated with each permutation. Then
calculate an “overall” p-value of T, i.e.
B
1+ I(Ti > T0 )
p = i=0
.
1+B
where N = T − m + 1 is the number of vectors obtained from a time series {Yt }Tt=1 .
Now, given the divergence measure Cm,Y (h) − {C1,Y (h)}m , a test statistic for serial
independence in {Yt }Tt=1 is defined as
Cm,Y (h) − {C
√ 1,Y (h)}m
Sm,Y (h) = N , (7.44)
m,Y (h)
σ
√
m,Y
where σ 2 (h) is a consistent estimator of the variance of N Cm,Y (h)−{C1,Y (h)}m .
The specific estimator proposed by Brock et al. (1996) is
1 2 2m−2 (Km,Y − C2 ) + K m − C 2m
σ (h) = m(m − 2)C
4 m,Y m,Y m,Y m,Y m,Y
2j
m−1
m−j 2m−2j ) − mC
2m−2 (Km,Y − C
2 )],
+2 [C m,Y (Km,Y − Cm,Y m,Y m,Y (7.45)
j=1
where
N −2 N
−1
N
2
Km,Y = I(|Yi − Ys | < h)I(|Ys − Yt | < h),
N (N − 1)(N − 2)
i=1 s=i+1 t=s+1
and where the dependence of the terms in (7.45) on T and h has been suppressed
for notational clarity. Under the null hypothesis of serial independence, and by
exploiting the asymptotic theory for U-statistics, it can be shown that, as T → ∞,
D
Sm,Y (h) −→ N (0, 1), ∀h ∈ (0, ∞). (7.46)
The test statistic (7.44) is stated in terms of the data series {Yt }Tt=1 . Brock et
al. (1996) show that the limiting behavior of Sm,Y (h), under H0 of no serial depend-
ence, remains the same whether the model parameters are known or estimated in a
root-n consistent fashion. Thus, (7.44) can be adapted to test situations involving
“residuals” {et }Tt=1 . The resulting diagnostic test, called BDS test statistic after its
three originators Brock, Dechert, and Scheinkman, is defined as
√ Cm,e (h) − {C1,e (h)}m
Sm,e (h) = T , (7.47)
m,e (h)
σ
where in this case the sample correlation integral is given by
−1
T
t−1 m−1
m,e (h) = T −m+1
C I(|et−j − es−j | < h),
2
t=m+1 s=m j=0
280 7 TESTS FOR SERIAL INDEPENDENCE
Figure 7.3: (a) Estimated correlation integral log10 Cm,Y (h); (b) Slope estimates βm for
a simulated ExpAR(1) process; T = 2,000.
m,e
and where σ 2 (h) follows from (7.45). Under H , the test statistic (7.47) is again
0
asymptotically standard normal distributed.
The correlation dimension of {et }Tt is defined as
indicating that C m,e (h) ∝ hDm . Notice, the dimensionality of the distribution of
{Yt , t ∈ Z} need not be an integer number, which in chaos theory is an indication
of a fractal structure. For a given value m, the relationship between log C m,e (h)
and log h can be illustrated as the slope of log Cm,e (h) = Dm × log h. The slope
will converge to a stationary value for increasing lengths m of the delay vector Yt ,
when the dynamic system is deterministic; when the limit in (7.48) is finite. When
the dynamical system is stochastic, the slope continually increases as m increases;
the limit in (7.48) is infinite.
Rather than using an estimator of the slope for a single value h, Koc̆enda and
Briatka (2005) propose to use an estimator of the average slope across a range of
values h, which means calculating βm as a consistent estimate of the slope coefficient
βm from the LS regression
m,e (hi ) = αm + βm log hi + ui ,
log C (i = 1, . . . , n), (7.49)
where αm is an intercept, ui an error term, and n the number of hi ’s taken into con-
sideration. However, these authors ignore the fact that Cm,e (·) is an empirical CDF
(of distances between pairs of points). A regression ignoring this will be inefficient,
as it leads to correlated residuals.
m,e (h).
(ii) Compute C
(h)}B .
(iii) Repeat steps (i) – (ii) B times, to obtain {C
(b)
m,
e b=1
1+ (h) ≥ C
I C m,e (h)
b=1 m,
e
p
BDS
= .
1+B
test statistic a useful diagnostic tool in the context of nonlinear time series analysis.
On the other hand, the BDS test statistic suffers from some problems (Brock et al.,
1991).
• There is arbitrariness in the choice of h, which may affect both the power and
size of the test. In fact, some choices of h may render the BDS test statistic
inconsistent against certain alternatives. Thus, the probability of rejecting H0
does not always approach 1, as T → ∞. In practice, h is usually taken as a
fraction of the standard deviation of the time series under study.
• Another problem is that the BDS test statistic, though asymptotically normal
under the null hypothesis, has high rates of Type I error, especially for non-
Gaussian data.5
In the next section various extensions of the BDS test statistic are considered
that are freed from some or all of these drawbacks.
√ C 1,W (h)}m
m,W (h) − {C
Sm,W (h) = T , (7.51)
m,W (h)
σ
where Cm,W (h) and σ m,W (h) are defined in a similar way as respectively (7.43) and
(7.44). In analogy with Sm,e (h) it can be shown that the large-sample distribution
of Sm,W (h) is standard normal under the null hypothesis of no serial dependence.
In a similar fashion, Genest et al. (2007) propose a rank-based analogue of the
BDS test statistic. Let et = rank(et )/(T + 1) denote the normalized ranks of the
time series {et }Tt=1 . Write W t = (W1,t , . . . , W
m,t ) = (
et , . . . , et−m+1 ) . Then a
rank-based version of Sm,W (h) may be defined as
√ C ( (h)}m
( (h) − {C
m,W 1,W
( (h) =
Sm,W T . (7.52)
m,W
σ ( (h)
D
( (h) −→ N (0, 1),
Again, under the H0 of no serial dependence, it follows that Sm,W
∀h ∈ (0, ∞), as T → ∞.
5
This problem does not occur with the permutation-based BDS test statistic (Algorithm 7.2),
as it has exact size.
7.4 HIGH-DIMENSIONAL TESTS 283
Table 7.2: Rank-based BDS test statistics of serial independence using three functionals
(direct integration (D), Kolmogorov–Smirnov (KS), and Cramér–von Mises (CvM)), and
two empirical processes.
T (u) = T −1 m ( (
k=1 I(|Wk,j − Wk,i | ≤ uk ) with u = (u1 , . . . , um ) ∈ [0, 1] ;
(1) B m
2 1≤i≤j≤T
G T (h) = BT (h, 1, . . . , 1) with h ∈ (0, 1].
∗ (u) = T −1 T m {F(w k,i − uk )}, where F(·) is the distribution of a U (0, 1)
k=1 k,i + uk ) − F (w
(2) B T i=1
random variable; B (u) =∗ m
G(uk ) with G(·) a Beta(1,2) distribution.
T k=1
Clearly, the finite-sample performances of the test statistics (7.51) and (7.52)
depend on the choice of h. A common way to get around this problem is to integrate
out h with regard to some empirical process using various continuous functionals.
Adopting direct integration (D), the KS and CvM functionals, and two empirical
processes, Genest et al. (2007) propose six rank-based BDS test statistics; see Table
7.2. Moreover, they show that under H0 , all six test statistics converge in distribution
to centered Gaussian variables.
Figure 7.4: S&P 500 daily stock price index for the time period 1992 – 2003 (3,102
observations) with two subperiods, denoted by vertical red medium dashed lines, from Novem-
ber 2000 – February 2003 (T = 608) and March 2003 – December 2003 (T = 218).
Table 7.3: Bootstrap p-values of seven test statistics for serial independence applied to
daily S&P 500 stock returns. Time period November 2000 – February 2003 (T = 608), and
March 2003 – December 2003 (T = 218); B = 1,000. Blue-typed numbers indicate rejection
of H0 at the 5% nominal significance level.
(geometric) random walk possibly with drift. We consider two sample sub-
periods. The first one (11/2000 – 02/2003; T = 608), corresponds to the worst
decline in the S&P 500 index since 1931, with the end of the “dot-com bubble”
around November 2000. The second time period (03/2003 – 12/2003; T = 218)
corresponds to an upward trend with moderate volatility, indicating the start
of a new bull market in the first quarter of 2003. Using the circular version
of the BDS test statistic, we test for serial independence in the series of daily
stock returns, Rt = log(Pt /Pt−1 ), with h = σ R , i.e. the standard deviation of
{Rt }t=1 . In addition, using the six ranked-based test statistics, we investigate
T
t = rank(Rt )/(T + 1).
R
Table 7.3 reports bootstrapped p-values, based on B = 1,000 bootstrap rep-
licates, for each of the seven test statistics. Note that for the first, downward,
period the results of almost all test statistics suggest that the underlying DGP
is not i.i.d. On the other hand, the p-values of the circular BDS test statistic
Sm,R , and the rank-based test statistics Im,R and M are insignificant at
m,R
the 5% nominal level for all values of m. The second, upward, period shows
a very different picture. There, except for the test statistics Tm,R , almost all
test results suggest that the process {Rt , t ∈ Z} is i.i.d., i.e., the S&P 500 daily
stock price index follows a random walk.
√ #
m
$
Hm,T (y) = T Fm,T (y) − F(yi ) , y ∈ Rm , (7.53)
i=1
where
T −m+1
m T −m+1
1 1
Fm,T (y) = I(Yt+i−1 ≤ yi ), and F(yi ) = I(Yt+i−1 ≤ yi ),
T T
t=1 i=1 t=1
(i = 1, . . . , m).
Various functionals of (7.53) can be used for testing the null hypothesis (7.4). Del-
gado (1996) proposes the CvM functional. When m = 2, the resulting test statistic
D (see Table 7.4) has the same asymptotic null distribution as the test statistic
Δ m,T
of Blum et al. (1961) in the bivariate case. However, for m > 2, the asymptotic
covariance function of Δ D is not convenient for the tabulation of critical values,
m,T
due to the complex nature of the limiting distribution of Hm,T (·).
High-dimensional test statistics leading to considerably simpler asymptotic cov-
CvM
ariances under the null hypothesis than Bm,T can be based on the Möbius transform-
ation (Rota,
# 1964), or decomposition,
$ of the process Hm,T (·). Consider an index set
S m = A ⊆ {1, . . . , m}; |A| > 1 , where |A| is the cardinality of the index set A.
Since |A| = m, S m contains 2m − m − 1 elements. Now, the Möbius transformation
M decomposes Hm,T (·) into 2m − m − 1 sub-processes GA,T = MA (Hm,T ), namely
GA,T (y) = (−1)|A\B| Hm,T (y) F(yi )
B⊆A i∈A\B
1
T −m+1
=√ I(Yt+i−1 ≤ yi ) − F(yi ) , y ∈ Rm , (7.54)
T t=1 i∈A
where i∈∅ = 1 by convention. In this case, the characterization of serial independ-
ence of (Y1,t , . . . , Ym,t ) is equivalent to having MA (·) ≡ 0, for all A ⊆ {1, . . . , m}.
It follows from standard theory (see, e.g., Shorack and Wellner, 1984) that under
the null hypothesis of (serial) independence, GA,T (·) converges weakly to a continu-
ous centered Gaussian process with covariance function
CovA (x, y) = min{F (xi ), F (yi )} − F (xi )F (yi ) , x, y ∈ Rm ,
i∈A
CvM
When m = 2, (7.55) simplifies to the single test statistic M{1,2},T which, inter-
3 () at lag = 1. Thus, a Möbius
estingly, coincides with the test statistic Δ ST
T
transformation is not needed in this particular case. Under the null hypothesis of
CvM
(serial) independence, the limiting distribution of MA,T is given by
λ(i1 ,...,i|A| ) Z(i2 1 ,...,i|A| ) ,
(i1 ,...,i|A| )∈N
where the Z(i1 ,...,i|A| ) ’s are independent N (0, 1) random variables; Deheuvels (1981).
CvM
Observe that the sets A contribute differently to each of the test statistics MA,T ,
with the biggest contribution coming from small-sized sets. To avoid this problem,
CvM
it is convenient to standardize MA,T by the asymptotic mean and variance of ξ|A|
which are, respectively, given by E(ξ|A| ) = 1/6|A| and Var(ξk ) = 2/90|A| . The lower
part of Table 7.4 displays the two resulting test statistics, denoted by the short-hand
notation GKR1 and GKR2 .
An obvious limitation of tests based on the above approach is the dependence
of the asymptotic null distribution of the GA,T (·)’s on the marginals of Hm,T (·).
To alleviate this problem, the original observations are replaced by their associated
ranks in Section 7.4.4.
1
T −m+1 m
m
CT (u) = √ I{Rt+i−1 ≤ (T + 1)ui } − ui , u ∈ [0, 1]m , (7.56)
T t=1 i=1 i=1
where {Rt }Tt=1 are the ranks of {Yt }Tt=1 . Using the Möbius transformation, Genest
and Rémillard (2004) define the 2m − m − 1 stochastic processes
1
T −m+1
GcA,T (u) = √ I{Rt+i−1 ≤ (T + 1)ui } − UT (ui ) , (7.57)
T t=1 i∈A
7.4 HIGH-DIMENSIONAL TESTS 287
T !
%2
Delgado (1996)
D =
Δ Hm,T (Yt ) ,
m,T
t=1
T −m+1
+ m m !
+ T −m+1
%
1 1
where Hm,T (y) = I(Yt+i−1 ≤ yi ) − I(Yt+i−1 ≤ yi )
T t=1 i=1
T t=1 i=1
,
Ghoudi et al. (2001)
GKR1 =
Δ CvM
MA,T − (1/6|A| ) / 2/90|A| ,
m,T
,
A * *
GKR2 = max ** M CvM − (1/6|A| ) / 2/90|A| **,
Δ m,T A,T
A
-
CvM = {G 2
where MA,T A,T (y)} dFm,T (y)
1 +!
T −m+1 %
with GA,T (y) = √ I(Yt+i−1 ≤ yi ) − F
(yi )
T t=1 i∈A
Some algebra shows that (7.58) can be computed directly from the ranks as
CvM,c 1
T −m+1
T −m+1 2T + 1 Rt+i−1 (Rt+i−1 − 1)
MA,T = +
T 6T 2T (T + 1)
t=1 s=1 i∈A
Rs+i−1 (Rs+i−1 − 1) (Rt+i−1 ∨ Rs+i−1 )
+ − . (7.59)
2T (T + 1) T +1
Since the subset A and its δ-translate, say A+δ, generate basically the same process,
computation of the test statistic (7.59) can be restricted to subsets A ∈ Am = {A ⊂
CvM,c
I m ; 1 ∈ A, |A| > 1} with cardinality 2m−1 − 1. The limiting distribution of MA,T
CvM
is the same as that of MA,T .
Multivariate
Kojadinovic and Yan (2011) address the generalization of the univariate serial copula
correlation test to the case of continuous multivariate time series. Consider a strictly
stationary ergodic sequence of q-dimensional random vectors Y1 , Y2 , . . ., where the
common distribution function of each Yt = (Y1,t , . . . , Yq,t ) is denoted by F (·) and
the associated copula by C(·). Furthermore, let m > 1 be an integer, let T =
T + m − 1, and, for any i ∈ Rq , let Ri,1 , . . . , Ri,T be the ranks associated with
the univariate sequence {Yi,t }Tt=1 . The ranks are related to the univariate empirical
288 7 TESTS FOR SERIAL INDEPENDENCE
marginal distribution function Fi,T (Yi,t ) through the equalities Ri,t = T Fi,T (Yi,t )
(t = 1, . . . , T ; i = 1, . . . , q) .
To build an empirical copula in the multivariate case, we need to introduce
some notation. First, given the index set B ⊆ {1, . . . , m}, we define the vector
uB ∈ [0, 1]mq by
(j) 2
(j) u if j ∈ i∈B {(i − 1)q + 1, . . . , iq},
uB =
1 otherwise.
Next, given u ∈ [0, 1]mq and i ∈ {1, . . . , m}, define the sub-vector ui ∈ [0, 1]q of u
by
(j)
ui = u(j+(i−1)q) , (i = 1, . . . , m; j = 1, . . . , q).
T
m
q
T
m
q
s (u) = 1 j,T (Yj,t+i−1 ) ≤ u(j) = 1 (j)
As noticed by Ghoudi et al. (2001) in the univariate case, it follows from the Möbius
decomposition
√ (transformation)
√ of CsT (·), that the limiting distribution of the pro-
cesses T MA (CsT ) and T MA+δ (CsT ) √ are roughly the same. Hence, attention can
be restricted to the 2 m−1 − 1 processes T MA (CsT ) for A ∈ Am . Then, after some
tedious algebra, the resulting CvM test statistics are given by
1 (Rk,r+i−1 ∨ Rk,s+i−1 )
T T m
+ 1 − . (7.61)
T2 T
r=1 s=1 k=1
The factor 1/2 ensures that the p-values are in the open interval (0, 1)
so that transformations by inverse CDFs of continuous distributions
are always well-defined.
• Combined p-values:
For all i ∈ {0, 1, . . . , B}, compute
(i)
# (i) $
FT = −2 log p(MA,q,T ) ,
A∈Am
and
(i) # (i) $
TT = min log p(MA,q,T ) .
A∈Am
B B
pF = I FT ≥ FT , and pT = I T T ≥ TT .
B B
b=1 b=1
290 7 TESTS FOR SERIAL INDEPENDENCE
Figure 7.5: Dependogram summarizing the results of the multivariate test of serial inde-
pendence for the climate change data set; q = 2, m = 5. A red star denotes the approximate
critical value.
U-statistic estimator
T −m+1 i−1 m−1
T − m + 1 −1
( m ) =
(1) (1)
μm , μ Kh (Yi+j , Ys+j ),
2
i=2 s=1 j=0
1 T −m+1
m−1
μ(2)
( (2) h,T (Yt+j ) ,
m ,μ m )= C
(T − m + 1)m
j=0 t=1
where
T −m+1
h,T (y) = 1
C Kh (y, Yi )
T −m+1
i=1
Figure 7.6: Upper panel: yearly Canadian lynx data for the time period 1821 − 1934
(blue solid line), and yearly Canadian snowshoe hare data (in thousands) for the time period
1905 − 1934 (red solid line). Lower panel: (a) the sample ACF for the complete lynx series,
and (b) the sample cross-correlation function (CCF) between the lynx series and the snowshoe
hare series for the time period 1905 − 1934. Both plots contain 95% asymptotic confidence
limits (blue medium dashed lines).
periodic fluctuations with sharp and large peaks and relatively small troughs. As
shown in Figure 7.6(a), the pattern of the sample ACF of the data indicates a cyc-
lical behavior of about ten years (a 9.61- year periodicity). The data set is assumed
to represent the relative magnitude of the lynx population and, hence, is of great
interests to ecological researchers. To understand the cyclical behavior in the Ca-
nadian lynx series, the upper panel of Figure 7.6 also shows 30 yearly observations
of the Canadian snowshoe hare series for the time period 1905 – 1934. Snowshoe
hares (prey) constitute a major part of the lynx’s (predator) diet. Note that the
hare series lags behind the lynx series. Indeed, as can be seen from the sample CCF
in Figure 7.6(b) there is a significant relationship between both series, but the lynx–
hare interaction is not instantaneous, rather there is a time delay of about 2 years.
According to McCarthy (2005), a possible cause of the cyclical fluctuations is that
hare populations increase and eat vegetation. In response, the vegetation produces
secondary defence compounds which are less palatable and nutritious. This triggers
a crash of the hare population – hares die in great numbers. However, the lynx con-
tinue to feed on hares, but run out of prey eventually. This is followed by a decline
in the lynx population. Next, the vegetation slowly recovers and this rejuvenates
7.5 APPLICATION: CANADIAN LYNX DATA 293
Table 7.5: Five models fitted to the Canadian lynx data set; T = 114.
(Pooled)
Reference Model
ε2
σ
Moran (1953) Yt = 1.0549 + 1.4101Yt−1 − 0.7734Yt−2 + εt 0.0459
⎧
⎪ 0.546 + 1.032t−1 − 0.173Yt−2 + 0.171Yt−3
⎪
⎨ −0.431Yt−4 + 0.332Yt−5 − 0.284Yt−6
Tong (1990, p. 387) Yt = (1) 0.0358(1)
⎪
⎪ +0.210Yt−7 + εt , Yt−2 ≤ 3.116
⎩ (2)
2.632 + 1.492Yt−1 − 1.324Yt−2 + εt , Yt−2 > 3.116
⎧
⎪
⎪ 0.083 + 1.096Y + ε
(1)
, Yt−2 ≤ 2.373
⎪
⎪
t−1 t
⎪
⎨ 0.63 + 0.96Yt−1 − 0.11Yt−2
Tsay (1989) Yt = +0.23Yt−3 − 0.61Yt−4 + 0.48Yt−5 0.0348(2)
⎪
⎪
⎪ −0.39Yt−6 + 0.28Yt−7 + ε(2)
⎪ t , 2.373 < Yt−2 ≤ 3.154
⎪
⎩ (3)
2.323 + 1.530Yt−1 − 1.266Yt−2 + εt , 3.154 < Yt−2
Ozaki (1982) (3) 2 )]Y
Yt = [1.167 + (0.316 + 0.982Yt−1 ) exp(−3.89Yt−1 t−1
−[0.437 + (0.659 + 1.26Yt−1 ) exp(−3.89Yt−1
2 )]Y
t−2 + εt 0.0433
Teräsvirta (1994) Yt = 1.17Yt−1 + (−0.92Yt−2 + 1.00Yt−3 − 0.41Yt−4 + 0.27Yt−9
−0.21Yt−11 ) × [1 + exp{−1.73 × 1.8(Yt−3 − 2.73)})−1 + εt 0.0350
(1) (1) (2)
Var(ht ) = 0.0259 and Var(ht ) = 0.0505.
(2) Var(h(2) ) = 0.015, Var(h(2) ) = 0.025, and Var(h(3) ) = 0.053.
t t t
(3) As suggested by Tong (1990), the parameter 1.167 in the ExpAR(2) model replaces the original
parameter 0.138 given by Ozaki.
Table 7.6: Bootstrap p-values of eight test statistics for high-dimensional serial independ-
ence applied to the residuals of five time series models fitted to the log of the Canadian lynx
time series (see Table 7.5); T = 114, B = 1,000. Blue-typed numbers indicate rejection of
H0 at the 5% nominal significance level.
Not surprisingly, the lack of fit of Moran’s AR(2) model, and Ozaki’s ExpAR(2)
model has been noted by other researchers. However, the fact that the residuals of
the LSTAR(11) model do not pass all test statistics is new. It suggests that the
model may be further improved. Finally, note that for the AR(2) model no evidence
of residual dependence is detected by Im,T
∗ when m = 2, while for m = 4 and m = 6
the p-value of this test statistic is smaller than the 5% nominal significance level.
Thus, it is recommended not to rely completely on low-dimensional test results.
have reasonable size and power properties compared with many nonlinear alternat-
ives. We should emphasize, however, that adopting the limiting null distribution of
a test statistic can be hazardous, except for very large sample sizes T . When using
random permutation or bootstrapping approaches the size of a test statistic is often
much closer to its nominal significance level for T < 500.
On the other hand, it is now generally believed that many empirical time series,
while nonlinear, are generated by high-dimensional processes. Hence, it is natural
to consider test statistics designed for this purpose. In this case, several of the rank-
based extensions of the BDS test statistic discussed in Section 7.4.2, and the copula-
based test statistics of Section 7.4.4 are useful. In particular, these test statistics
are more powerful than their single-lag and multiple-lag counterparts, with Tm,T as
the best performing rank-based BDS test.
The asymptotic properties of nonparametric estimators of copulas for time series processes
are considered by Fermanian and Scaillet (2003), and Ibragimov (2009), among others.
Section 7.4: Matilla–Garcia and Ruiz–Marin (2008) propose a test statistic for high-
dimensional serial independence using symbolic dynamics and permutation entropy. The
test requires unrealistic large sample sizes for dimensions m ≥ 6. De Gooijer and Yuan
(2016) explore a link between the correlation integral and the Shannon entropy, or second
order Rényi entropy, to derive two nonparametric portmanteau-type test statistics for serial
independence. In commonly used samples, both tests performed similarly as the best per-
forming rank-based BDS test statistics of Section 7.4.2.
Baek and Brock (1992a) extend the BDS test statistic to vector time series. Wolff and
Robinson (1994) observe that the estimator of the unnormalized correlation integral has
a limiting Poisson distribution under some moderate assumptions regarding the marginal
distribution. This motivated a nonparametric test procedure with slightly reduced size
distortion compared with the BDS test statistic. de Lima (1996) formulates five conditions
under which the BDS test statistic is asymptotically nuisance-parameter-free.
Within the context of independent component analysis, a concept that is important in signal
processing and neural networks, a subsampling pairwise test statistic for serial independ-
ence has been suggested by Karvanen (2005), based on the test of total independence by
Kankainen and Ushakov (1998). Related to this, is the paper by Wu et al. (2009). They pro-
pose a smoothed bootstrap-based test statistic for high-dimensional serial independence in
multivariate time series data by combining pairwise independence tests for all pairs. Other
recently proposed test statistics suitable for both time-independent and time-dependent com-
ponent analysis have been derived by, among others, Achard (2008), Baringhaus and Franz
(2004), Fernández et al. (2008), Székely et al. (2007) (see the R-energy package), Gretton et
al. (2005), and Zhou (2012).
Evidently, many density-based serial correlation tests require the data come from a
continuous population. Although they will no longer be distribution free, some of the dis-
cussed test statistics can also be used in the discrete case. For instance, the Skaug–Tjøstheim
(1993b) test statistic ΔST
T
1
can be applied to continuous as well as to discrete (or discretized)
data, after some slight adjustment of the form of the test. For a stationary sequence of a
categorical variable, high-dimensional serial independence can be checked via a test statistic
developed by Bilodeau and Lafaye de Micheaux (2009).
The so-called k-nearest neighbor density estimator avoids the problem of a pre-defined grid
required to compute the multi-dimensional copula-based histogram estimator discussed in
Section 7.3.2; see Blumentritt and Schmid (2012). Alternatively, for estimating the copula
density, a nonparametric method proposed by Kallenberg (2009) may be adopted.
Exercise 7.7: Various MAR models are available in the literature. Le et al. (1996), and
Wong and Li (2000b, 2001) assume that the mixing proportions are time invariant. More
general (Gaussian) MAR and MAR–GARCH models follow by assuming that the mixing
proportions are functions of observed variables; see, e.g., Lanne and Saikkonen (2003), and
Kalliovirta et al. (2015) and the references therein. Sufficient conditions for strict and second
order stationarity are given by, among others, Zeevi et al. (2000), Wong and Li (2000b), and
Saikkonen (2008).
7.8 DATA AND SOFTWARE REFERENCES 297
Section 7.4.5: The C++ code of the Δ DP test statistic (7.62) can be downloaded from
m,T
Cees Diks’ web page located at http://cendef.uva.nl/people.
298 7 TESTS FOR SERIAL INDEPENDENCE
Appendix
Fn (x + hn ) − Fn (x − hn )
fhn (x) =
2hn
1 |Xi − x|
n n
1
= I(x − hn ≤ Xi ≤ x + hn ) = I ≤1 . (A.1)
nhn i=1 2nhn i=1 hn
Clearly, fhn (·) counts the proportion of observations falling in the neighborhood of x. The
parameter hn , (bandwidth), controls the degree of smoothing: the greater hn , the greater
the smoothing.
Equation (A.1) is a special case of what is called kernel density estimator with a weight
function, or kernel, K(·) = 12 I(| · | ≤ 1). The general, basic, kernel estimator may be written
compactly as
1 x − Xi 1
n n
fhn (x) = K = Kh (x − Xi ), (A.2)
nhn i=1 hn n i=1 n
where Khn (·) = K(·/hn )/hn . Here, K(·) is a so-called kernel function.
Kernel functions +
A kernel function K : R → R is any function for which R K(u)du = 1. A non-negative
APPENDIX 7.A 299
(1)
All kernels are supported on the interval [−1, 1] except for the Gaussian kernel
which has infinite support.
kernel satisfies K(u) ≥ 0 ∀u which ensures that K(·) is a pdf. A symmetric kernel satisfies
K(u) = K(−u) ∀u. In this case all odd moments of a kernel are zero, where the moments
of K(·) are defined by
μj (K) = uj K(u)du.
R
The use of symmetric and unimodal kernels is standard in nonparametric estimation, and
will henceforth be adopted. The order of a kernel, say ν, is defined as the first non-zero
moment, i.e. if μ0 (K) = 1 and μj (K) = 0 (j = 1, . . . , ν − 1), but μν (K) = 0. Some common
second-order kernel functions are listed in Table 7.7 and exhibited in Figure 7.7. The first
four second-order kernels are special cases of the polynomial family
(2p + 1)!!
K[2],p (u) = (1 − u2 )p I(|u| ≤ 1), (p = 0, 1, 2, 3).
2p+1 p!
The Gaussian kernel follows by taking the limit p → ∞ after re-scaling. Higher-order kernels
are smoother, reducing the order of the bias of the curve estimator provided large sample
sizes (n & 1, 000) are available. The basic shape of the kernels are similar. Since, however,
higher-order kernel functions take on negative values, the resultant estimate of f (·) also can
have negative values.
2
2
MSE fhn (x) = E fhn (x) − f (x) = Bias fhn (x) + Var fhn (x) . (A.3)
Since we can reverse the order of integration (over the support of X and over the probability
+
space of X), we have MISE fhn (x) = R MSE fhn (x) dx so that MISE equals to the
integrated MSE, a measure which does not depend upon the data.
Ideally, we want to pick a bandwidth value hn such that it minimizes the MISE. How-
ever, the optimal bandwidth that minimizes the MISE depends on the unknown pdf f (·).
In order to make progress under this distance measure, it is usual to employ asymptotic
approximations to bias and variance of the kernel
density
+ estimator.
The result is called
asymptotic MISE (AMISE), i.e., AMISE fhn (x) = R AMSE fhn (x) dx with AMSE the
asymptotic MSE of fhn (·). The optimal bandwidth, say hopt , is the one that minimizes the
AMISE(·), giving rise to AMISEopt (·).
Now, given that we have selected the kernel order ν, which kernel should we use? It
is straightforward to verify (cf. Exercise 7.7) that the kernel’s contribution to the optimal
AMISE is the following dimensionless factor:
1/(2ν+1)
AMISEopt (K) ∝ μ2ν (K)R(K)2ν , (A.4)
+
where R(g) = R g 2 (z)dz is the roughness penalty of the function g(·) (column three of Table
7.7). Then, to compare kernels, the efficiency (eff) of kernel K(·) relative to kernel K ∗ (·) is
defined as
AMISE (K) (2ν+1)/2ν μ2 (K) 1/2ν R(K)
opt ν
eff(K) = = . (A.5)
AMISEopt (K ∗ ) μ2ν (K ∗ ) R(K ∗ )
Bandwidth selection
For practical problems the choice of the kernel is not so critical, as compared to the choice of
the bandwidth. The bandwidth depends on the sample size n and has to fulfill hn → 0 and
nhn → ∞ when n → ∞ as a necessary condition for consistency of the density estimator.
Clearly, this result is not very helpful for finite-sample application. Rather, we may use the
(ν)
AMISE-optimal bandwidth with R(f (ν) (·)) replaced by R(gσX (·)) where gσX (·) is a plausible
reference density, σX is the sample standard deviation, and f (ν) (·) is the νth derivative of
APPENDIX 7.A 301
X Cν (K)n−1/(2ν+1) ,
hrot = σ (A.7)
where
√π(ν!)3 R(K) 1/(2ν+1)
Cν (K) = 2 .
2ν(2ν)!μ2ν (K)
The last column of Table 7.7 shows values of Cν (·) when ν = 2. If a Gaussian second-
X n−1/5 . Rule-of-thumb bandwidths
order kernel is used, (A.7) is often simplified to hrot = σ
are sensitive to outliers. A robust version of the rule-of-thumb bandwidth rule is hrot =
min{σX , (IQRX /1.34)}n−1/5 where IQRX is the interquartile range computed from the
sample distribution of X.
Rule-of-thumb bandwidths are “pilot” bandwidths, i.e. they are a useful starting point.
A more flexible way for obtaining bandwidths is to use a so-called plug-in bandwidth pro-
cedure. This method is based on considering some type of quadratic error between the true
function and its estimator. Minimizing an asymptotic approximation of the resulting error
and replacing the unknown parameters by estimates gives the optimal (plug-in) bandwidth.
Plug-in methods have been extensively studied for nonparametric univariate density estim-
ation, but for multivariate data the choice of a method is less clear. A flexible and generally
applicable alternative, is CV.
1 −1
1
n n
fH (x) = K H (x − Xi ) = KH (x − Xi ), (A.8)
n|H| i=1 n i=1
where
p
∂ν
∇ν f (x) = f (x).
j=1
∂xνj
When the observed data set is from a multivariate normal density ϕ, an explicit expres-
sion for R(∇ν ϕ) can be calculated straightforwardly. By replacing R(∇ν f ) by R(∇ν ϕ) in
(A.9), we obtain the rot-bandwidth
where
π p/2 2p+ν−1 (ν!)2 R(K)p 1/(2ν+p)
Cν,p (K) =
,
ν (2ν − 1)!! + (p − 1)((ν − 1)!!)2 μ2ν (K)
and with σj the standard deviation of the jth variable, which can be replaced by its sample
estimator in practical applications. The constant Cν,p (·) is exactly 1 in the bivariate case
(p = 2), with a second-order Gaussian kernel. Numerical values of Cν,p (·) for other combin-
ations of kernel functions, p, and ν can be obtained directly using the results for R(·) and
μν (·) given in Table 7.7.
Note from (A.8) that, unless Xi is distributed more or less uniformly in the p-dimensional
space, there is the risk that for a given bandwidth, no data lies in the neighborhood specified
by H. This problem becomes worse as p increases, and is known as the “curse of dimen-
sionality”. Hence, in practice, multivariate kernel density estimation is often restricted to
dimension p = 2.
Nadaraya–Watson estimator
Let {(Xi , Yi )}ni=1 represent n independent observations of the random pair (X, Y ), where
X = (X1,i , . . . , Xp,i ) is a p-variate random variable. To keep things simple, we assume that
such data is generated by the process
Yi = μ(Xi ) + εi , (A.11)
where {εi } is a sequence of i.i.d. zero mean and finite variance random variables such that
εi is independent of Xi , and μ : Rp → R is an “arbitrary” function called the nonparametric
regression function and it satisfies μ(x) = E(Y |X = x) (x ∈ Rp ).
We wish to estimate μ(·). If μ(·) is a smooth function at point x = (x1 , . . . , xp ) , re-
sponses corresponding to Xi ’s near x should contain some information about the value of
μ(·). Therefore, local averaging of the responses about X = x may yield a meaningful es-
timate of μ(·). One particular formulation, called Nadaraya–Watson (NW) kernel estimator
and attributed to Nadaraya (1964) and Watson (1964), uses a kernel function to vary the
weights given to the responses. In particular, a kernel estimate of μ(·) is a weighted average
of observations in the neighborhood of x, and is defined as
n n
i=1 KH (x − Xi )Yi
NW
μ H (x) = n = Wi (x)Yi , (A.12)
i=1 KH (x − Xi ) i=1
n
with the weights Wi (x) = KH (x − Xi )/ i=1 KH (x − Xi ) summing up to one, and where
H is a p × p symmetric positive definite matrix of bandwidths.
APPENDIX 7.A 303
Figure 7.8: Local averages: (a) based on n = 20 observations from the DGP Yi = Xi3 + εi
i.i.d. i.i.d.
with {εi } ∼ N (0, 1), and {Xi } ∼ U [−2, 2]; (b) based on n = 100 observations from the
same DGP as in part (a).
Figure 7.8(a) shows two NW kernel smoothed averages based on the series {(Xi , Yi )}20
i=1
i.i.d. i.i.d.
generated from the model Yi = Xi3 + εi with {εi } ∼ N (0, 1), and {Xi } ∼ U [−2, 2].
The true regression function y = x3 is shown by the black solid line. Using a Gaussian
kernel with hn = 0.3, the local averages are shown as a blue medium dashed line, and
the local average corresponding to hn = 0.1 by the red dotted line.
The kernel discriminates each Yi according to the distance of its corresponding Xi
from x and has its greatest value at the origin. Generally, it is positive and symmetric,
and decreases from the origin. In this way, the kernel has the effect of reducing bias
without increasing variance. The bandwidth hn controls the ‘width’ of the kernel and
is used to ‘tune’ the degree of smoothing: the greater hn , the greater the smoothing.
Clearly, the blue medium dashed line is less ‘wiggly’, and hugs closer to the true
regression curve than the red dotted line. Overall, the NW estimator with hn = 0.3 is
to be preferred because, intrinsically, its variance and squared bias are better balanced.
As n increases, variance will decrease as more averaging is performed. Then hn should
be decreased to reduce the amount of local smoothing – thus reducing bias – but not
so much as to effect a comparable increase to the variance, i.e. hn → 0 as n → ∞.
As n becomes large, we may expect the estimate to converge to the true curve at
every point x. Figure 7.8(b) illustrates convergence effects and shows local averages
computed for n = 100.
Optimum convergence of the kernel estimate can be achieved by selecting the bandwidth
hn using CV. It uses the aptly named leave-one-out estimator μ−i
hn (·) of μ(·). At Xi = x,
304 7 TESTS FOR SERIAL INDEPENDENCE
with weights Wj−i (·) as defined in (A.12); superscript −i indicates the absence of Yi in the
averaging, and hn the explicit dependence on the bandwidth. The CV function is then
defined as the sample-average MSE that results from adopting the leave-one-out estimator,
i.e.,
1
n
CV(hn ) = −i
{Yi − μ 2
hn (Xi )} . (A.14)
n i=1
The (global) bandwidth hCV that minimizes (A.14) across a pre-specified range of values hn is
then used to compute the kernel estimate μhn (·). Typically, CV(·) has one unique minimum
with no other local minima. In the i.i.d. case, the CV routine produces asymptotically
optimal kernel estimates. For dependent data, convergence results of the CV bandwidth
selection method have been obtained for certain types of mixing processes and univariate
regression functions.
Note that the computation of one value of CV(·) requires n2 kernel evaluations, which
may be unacceptable when n is large. A variety of refinements of the CV bandwidth selec-
tion method are available to address this problem. For instance, minimizing a generalized
CV function, or minimizing the final prediction error. Another way for obtaining global
bandwidths is to use a plug-in bandwidth procedure.
p
with Khn (v) = h−p
n i=1 K(vi /hn ). The above minimization problem can be rephrased in
matrix notation to allow for direct computation using weighted least squares. For instance,
with d = 1, the so-called local linear (LL) estimator is given by
−1
LL
μhn (x) = e (Xx Wx Xx ) Xx Wx y, (A.16)
where e is a (d+1)×1 vector having 1 in the first entry and zeros elsewhere, y = (Y1 , . . . , Yn )
is the vector of responses,
⎛ ⎞
1 (x − X1 )
⎜ ⎟
Xx = ⎝ ... ..
. ⎠
1 (x − Xn )
APPENDIX 7.B 305
is an n × n matrix of weights.
In general, the local polynomial estimator is more attractive than the NW estimator
because of its better asymptotic bias performance. Moreover, the estimator does not suffer
from boundary effects, and hence does not require modifications in regions near the end
points of the support set. Another useful feature is that the method immediately estimates
hn (·) = r!βmr (·).
(r)
the rth derivative, μ(r) (·) (r = 1, . . . , d), via the relationship μ
Definition B.1 (Copula) Let C : [0, 1]m → [0, 1] be an m-dimensional distribution function
on [0, 1]m . Then C is a copula if it has uniformly distributed univariate marginal CDFs on
the interval [0, 1].
Another interpretation of a copula function follows from the probability integral transform
(PIT), Ui ≡ Fi (Xi ). If the marginal distribution functions F1 , . . . , Fm of F are continu-
ous, the random variable Ui will have the U (0, 1) distribution regardless of the original
distribution Fi , i.e.
Thus, the copula C of X represents the joint CDF of the vector of PITs of the random vector
U = (U1 , . . . , Um ) and thus is a joint CDF with U (0, 1) marginals.
The next theorem is cardinal to the theory of copulas.
306 7 TESTS FOR SERIAL INDEPENDENCE
The behavior of the copulas with respect to strictly monotonic transformations is estab-
lished in the next theorem; see Embrechts et al. (2003, Thm. 2.6). It forms the basis for the
role of copulas in the study of (multivariate) measures of association (dependence).
According to Nelsen (2006, Thm. 2.2.7), the partial derivatives ∂ C(u)/∂ui of C exist for
almost all ui (i = 1, . . . , m). Then we may define a copula density as follows.
m
f (x) = c F1 (x1 ), . . . , Fm (xm ) fi (xi ), (B.3)
i=1
where fi (xi ) is the density associated with the marginal CDF Fi (xi ). This representation
is particularly useful for copula ML parameter estimation because it provides an explicit
expression for the likelihood function in terms of the copula density and the product marginal
densities.
Every m-dimensional copula C (m ≥ 2) is bounded in the following sense:
Figure 7.9: Contour plots of three bivariate copula densities: (a) Gaussian copula with
ρ = 0.5, (b) Student tν copula with ρ = 0.9 and ν = 15 degrees of freedom, and (c) Student
tν copula with ρ = 0.9 and ν = 1 degree of freedom.
where M (·) and W (·) are the Fréchet–Hoeffding bounds. The upper bound M (·) is also
known as the comonotonic copula. It represents the copula of X, if each of the random
variables X1 , . . . , Xm can (a.s.) be represented as a strictly functional relationship between
Xi and Xj (i = j). This copula is also said to describe perfect positive dependence. The
lower bound W (·) is a copula only for dimension m = 2.
A wide range of copulas exists. The most commonly used copulae are the Gumbel
copula for extreme distributions, the Gaussian copula for linear correlation, and the
Archimedean copula and the Student t copula for dependence in the tail. A multivari-
ate Gaussian distribution Φ(·) with m × m correlation matrix R yields the Gaussian
copula
C t (u) = tν t−1 −1
ν (u1 ), . . . , tν (um )
t−1 t−1ν (um ) Γ( ν+m )|R|−1/2
ν (u1 ) 1 −1 − 2
ν+m
= ··· 2
ν 1 + y R y dy,
−∞ −∞ Γ( 2 )(νπ)m/2 ν
308 7 TESTS FOR SERIAL INDEPENDENCE
where t−1
ν (·) denotes the quantile function of a standard univariate Student tν distri-
bution. The multivariate Gaussian copula may be thought of as a limiting case of the
multivariate t copula as ν → ∞ ∀u ∈ [0, 1]m .
Based on three MC simulation samples of T = 10,000 observations, Figure 7.9 shows
contour plots of (a) a bivariate Gaussian copula density with correlation coefficient
ρ = 0.5, (b) a bivariate t copula density with ρ = 0.9 and ν = 15, and (c) a bivariate
Student tν copula density with ρ = 0.9 and ν = 1. We see that the copulas have
symmetric tail dependencies. The lower- and upper tail dependencies are better
captured with the tν=1 copula than the one with ν = 15 degrees of freedom.
Definitions
Let X1 , X2 , . . . be i.i.d. random variables with distribution function F taking values in an
m-dimensional Euclidean space Rm . Consider a measurable kernel function h : Rr → R
(r ∈ N), that is symmetric in its arguments. Suppose we wish to derive a minimum-
variance unbiased estimator of an estimable parameter (alternatively, statistical functional),
say θ = θ(F ). That is,
θ(F ) ≡ E[h(X1 , . . . , Xr )] = h(x1 , . . . , xr )dF (x1 ) · · · dF (xr ).
Rr
Then, given a (possibly multivariate) sequence {Xi }ni=1 (n ≥ r), the U-statistic of order r
(the letter U stands for unbiased) is given by
−1
n
Un = h(Xi1 , . . . , Xir ).
r
1≤i1 <i2 <···<ir ≤n
The basic theory of U-statistics is due to Hoeffding (1948) as a generalization of the notion
of forming an average. One well-known example is the sample variance with h(x1 , x2
) =
(x1 − x2 )2 /2. Another example is Kendall’s τ statistic (1.13) with h (x1 , y1 ), (x2 , y2 ) =
2I(x1 < x2 , y1 < y2 ) + 2I(x2 < x1 , y2 < y1 ) − 1. Also, it is easy to see that the correlation
integral (7.10) is a U-statistic with h(x, y) = I(
x − y
< h).
Closely related to the U-statistic is the V-statistic for estimating θ(F ), defined by
n
Vn = n−r h(Xi1 , . . . , Xir ).
i1 ,...,ir =1
Observe that
Vn = θ(Fn ) = h(x1 , . . . , xr )dFn (x1 ) · · · dFn (xr ),
Rr
APPENDIX 7.C 309
n
where Fn (x) = n−1 i=1 I(Xi ≤ x). This is an example of a differentiable statistical
functional, a class of statistics introduced by von Mises (1947) (hence the letter V). Clearly,
Vn is a biased statistic for r > 1, because the sum in the defining equation contains some
terms in which i1 , . . . , ir are not all distinct. However, the bias of V n is asymptotically
negligible (O(n−1 )). Also, for a fixed sample size n, the variance of Vn satisfies Vn =
Un + O(n−2 ). So, in terms of MSE, Vn may be preferred over Un .
A U-statistic (or V-statistic) of order r and variances σ12 ≤ σ22 ≤ · · · ≤ σr2 has a de-
generacy of order k if σ12 = · · · = σk2 = 0 and σk+1 2
> 0 (k < r). Many examples exist
of exact or approximate (as n → ∞) degenerate U- or V-statistics. For instance, it is
easy to prove that CvM–GOF type test statistics (see, e.g.,+Section 4.4.1) are degenerate
∞ ∞
V-statistics, i.e. ∫−∞ h(x, y)dF (y) = 0 ∀x, where h(x, y) = −∞ I(x ≤ z) − F (z) I(y ≤
z) − F (z) w(F (z) dF (z) with w(·) a non-negative weight function on (0, 1).
where Xc+1 , . . . , Xr are i.i.d. random variables from the distribution F . In fact, hc (·) is (a
version of) the conditional (hence the subscript letter c) expectation of h(X1 , . . . , Xr ) given
X1 , . . . , X c .
Since h0 = θ and hr (x1 , . . . , xr ) = h(x1 , . . . , xr ), the functions hc (·) all have expectation
θ. Further, note that the variance of the U-statistic U n depends on the variances of the
hc (·). Without loss of generality we may take σ02 = 0. Moreover, for c = 1, . . . , r, we define
so that σr2 = Var h(X1 , . . . , Xr ) . Using these preliminaries, it can be shown (Hoeffding,
1948) that the variance of Un is given by
−1 r
n r n−r 2
Var(Un ) = σ .
r c=1
c r−c c
n
n = θ + r
U h1 (Xi ) − θ .
2 i=1
Yoshihara (1976, Thm. 1) and Denker and Keller (1983, Thm. 1(c)) relax the assumption
of i.i.d. random variables Xi to accommodate strictly stationary weakly dependent processes.
Specifically, for a non-degenerate symmetric kernel h: Rr → R, and assuming that {Xi } is
β-mixing, these authors showed that
√ D
n (Un − θ) −→ N (0, r 2 σ12 ), as n → ∞.
This result can easily be applied to the correlation integral (7.10). As before, consider the
m-dimensional time series {Yt , t ∈ Z} for which each random variable is assumed to be
generated from the distribution Fm (·). Likewise, let the kernel be the indicator function,
and note then that
h1 (Yt ) = E[h(Yt , Xs |Xs = x)] = I(
y − x
≤ h)dFm (x).
Rm
Let h1 (y; h) ≡ h1 (y), so that the dependence on the bandwidth h of h1 (·) is made explicit.
Then the asymptotic distribution of the estimator C m,T (Y ; h), defined by (7.43), can be
expressed as
√
where
2
2
σm,T (Y ; h) = E h1 Y1 ; h) − Cm (Y, h)
T
+2 h1 (Y1 ; h) − Cm,Y (h) h1 (Yt ; h) − Cm,Y (h) .
t=1
where Zj are independent N (0, 1) random variables, and λj are the eigenvalues for the kernel
√ P
h2 (x1 , x2 )−θ. This result also
∞ applies to the V-statistic, since n(Un −Vn ) −→ 0, under the
additional assumption that j=1 λj < ∞. A more general version of this asymptotic result
is given by Beutner and Zähle (2014) using a new representation for U- and V-statistics.
In fact, their continuous mapping approach not only encompasses most of the results on
the asymptotic distribution known in literature, but also allows for the first time a unifying
treatment of non-degenerate and degenerate U- and V-statistics.
Exercises
Theory Questions
7.1 Let {Yt } be an i.i.d. process with distribution function F (y). An equivalent form of
the one-dimensional correlation integral is given by C1,Y (h) = P(|Yt −Ys | < h) (t = s).
EXERCISES 311
T −1
i−1
2,Y (h) = 2
C I(|Yi − Yj | < h)I(|Yi+1 − Yj+1 | < h).
(T − 1)(T − 2) i=2 j=1
7.2 Suppose {Yt , t ∈ Z} is a strictly stationary process generated by the following two
models:
ARCH(1): Yt = σt εt , σt2 = 1 + 2
√ θYt−1 ,
sign AR(1): Yt = θ sign(Yt−1 ) + 1 − θ εt ,
i.i.d.
where 0 < θ < 1, and {εt } ∼ N (0, 1). Given a set of observations {Yt }Tt=1 , the para-
meter θ can be estimated semiparametrically by
maximizing the pseudo log-likelihood
for the copula density c F(Yt ; θ), F(Yt−1 ; θ); θ where F(Yt ; θ) is the EDF. For test-
ing the null hypothesis of serial independence the associated semiparametric (denoted
by the superscript SP) score-type test statistic, apart from a normalizing-factor, is
defined as
T
∂ log c( t−1 ; θ)
ut , u
QSP = ,
t=2
∂θ θ=0
t ≡ F(Yt ; θ).
t are the realizations of U
where u
(a) Show for the ARCH(1) model, that the SP score-type test statistic is given by
T
2 2
QSP
ARCH = Φ−1 (
ut ) Φ−1 (
ut−1 ) ,
t=2
T
QSP
sAR = sign Φ−1 (
ut−1 ) Φ−1 (
ut ).
t=2
+
7.3 Δ ST2 () is the weighted functional Δ∗ () = 2 {f (x, y)−f (x)f (y)}f (x, y)dxdy given
T S
in Section 7.2.3. Let {Yt , t ∈ Z} be a Gaussian zero-mean stationary process. Show
that Δ∗ (·) satisfies the nonnegativity property Δ ∗ (·) ≥ 0, where the equality holds if
and only if Yt and Yt− are independent.
(Skaug and Tjøstheim, 1993a)
312 7 TESTS FOR SERIAL INDEPENDENCE
7.4 Let {et }Tt=1 be the residuals from a fitted time series model. Consider the least squares
regression (7.49). The slope coefficient βm can be estimated as
h log h − log h log C m,T (e; h) − log C m,T (e; h)
βm = 2 ,
h log h − log h
where log h is the logarithm of the tolerance distance, log Cm,T (e; h) is the logarithm
of the sample correlation integral, m is the embedding dimension, and where the bars
denote the means of their counterparts without bars. Show that
E[βm ] ≤ m.
(This was first proved by Cutler (1991), and later by Koc̆enda (2001)).
7.5 In Section 2.11 we fitted a RBF–AR(8) model to the EEG recordings (epilepsy data).
The data file epilepsyMR.dat contains the residual series {et }623
t=1 .
(a) Make a time series plot of the residuals. Also make a plot of the sample ACF
of the residuals (30 lags), and a histogram. What conclusions do you draw from
these graphs?
CvM,c
(b) The R-copula package contains the copula-based CvM test statistic MA,T for
CvM,c
testing univariate serial independence MA,T introduced in Section 7.4.4; see
Ghoudi et al. (2001) and Genest and Rémillard (2004). In this part, we invest-
igate the null hypothesis of serial independence of the residuals in a more formal
way.
• First, simulate the distribution of the CvM test statistic, the distribution
of the combined test statistic à la Fisher, and the distribution of the com-
bined test statistic à la Tippett. Use the function serialIndepTestSim with
lag.max=5, and fix the number of bootstrap replicates at 1,000 (default
value). [Note: The computations can be time demanding.]
• Next, using the function serialIndepTest, compute approximate p-values of
the test statistics with respect to the EDFs obtained in the previous step.
• Finally, display the dependogram.
Use the above results, to investigate the type of departure from residual serial
independence, if any.
7.6 Tong (1990, p. 178) fits the following SETAR(2; 2, 2) model to the (log10 ) Canadian
lynx data of Section 7.5:
(1)
0.62 + 1.25Yt−1 − 0.43Yt−2 + εt if Yt−2 ≤ 3.25,
Yt = (2)
2.25 + 1.52Yt−1 − 1.24Yt−2 + εt if Yt−2 > 3.25,
(1) (2)
where {εt } and {εt } are independent sequences of i.i.d. random variables with
(1) i.i.d. (2) i.i.d.
{εt } ∼ N (0, 0.0381) and {εt } ∼ N (0, 0.0621).
EXERCISES 313
where F t is the σ-algebra generated by {Yt , s ≤ t}, Φ(·) is the CDF of the N (0, 1)
distribution, φi,0 , φi,1 , . . . , φi,pi and σi are the AR parameters of the ith component
of the mixtures, and {πi }K i=1 is a set of so-called mixing proportions which satisfy
K
πi > 0 and i=1 πi = 1. A characteristic feature of the MAR model is that both its
conditional and unconditional marginal distributions are nonnormal and they can be
multimodal.
The BIC model selection criterion is given by BIC = −2T (y; θT ) + m log(T − n),
where T (y; θT ) is the value of the maximized log-likelihood function of the sample,
m is the dimension of the parameter vector θ, and n is the number of initial values.
Using this criterion, the best fitted MAR model is
Yt − 0.7107
(0.1798) − 1.1022(0.0621) Yt−1 + 0.2835(0.0826) Yt−2
F (Yt |F t−1 , θT ) = 0.3163 Φ
(0.0810) 0.0887(0.0202)
Yt − 0.9784
(0.1564) − 1.5279(0.0884) Yt−1 + 0.8817(0.0869) Yt−2
+ 0.6837 Φ ,
(0.0810) 0.0887(0.0202)
where asymptotic standard errors of the parameter estimates are given in parentheses, and
the value of BIC is −198.82.
(a) Check the adequacy of the fitted MAR model by computing the first 20 sample auto-
correlations of the Pearson residuals defined by (6.72). Repeat this step for the squared
Pearson residuals.
(b) Check the adequacy of the fitted MAR model by computing the first two diagnostic
test statistics in Table 6.3 (AT,K1 and HT,K2 ) using quantile residuals, and with K1 =
K2 = {5, 10, 15, 20, 25, 30}. Compare and contrast the results with those obtained in
part (a).
[Hint: Replace the covariance estimator Ω T in (6.89) by an estimator Ω using nu-
T
merical derivatives for both the log-likelihood function and quantile residuals given a
set of T = 20,000 simulated observations (Kalliovirta, 2012, p. 365)].
(a) Show that the bias and variance of fh (x), defined in (A.2), satisfy
1
Bias fh (x) = E fh (x) − f (x) = f (ν) (x)hν μν (K) + o(hν ),
ν!
1 1
Var fh (x) = f (x)R(K) + o( ),
nh nh
where f (ν) (·) denotes the νth derivative of f (·), assuming it exists. Comment
on the difference in bias between second- and higher-order kernels.
(b) Combine the results in part (a), to obtain the asymptotic MSE (AMSE) of fh (·).
Comment on the bias-variance trade-off.
(c) Derive an expression for the AMISE of fh (·).
(d) Show that by differentiating AMISE fh (x) with respect to h, and setting the
derivative equal to zero, the optimal bandwidth is given by
−1/(2ν+1) (ν!)2 R(K) 1/(2ν+1) −1/(2ν+1)
hopt = R f (ν) n .
2νμ2ν (K)
8.1 Preliminaries
A strictly stationary stochastic process {Yt , t ∈ Z} is defined to be TR if, for any
integer m and for all integers t1 , . . . , tn (−∞ < t1 < · · · < tn < ∞), the vectors
(Y−t1 , Y−t2 , . . . Y−tn ) and (Y−t1 +m , Y−t2 +m , . . . Y−tn +m ) have the same joint prob-
ability distribution. Letting m = t1 + tn , we see that for a strictly stationary process
{Yt , t ∈ Z} time reversibility implies that
Figure 8.1: (a) Scatter plot at lag 1 of the time series {Xt = Y1,t + Y2,1001−t }1,000
t=1 , where
{Yi,t , t ∈ Z} (i = 1, 2) are two independent realizations of the logistic map (1.22) with a = 4;
(b) Scatter plot at lag 1 of the time series {Xt∗ = Y1,t + Y2,t }1,000
t=1 .
ψY () = γ
(2,1) (1,2)
Y () − γ
Y (), ( ∈ Z),
T
() = (T − )−1
(i,j) i j
Y
where γ t=+1 Yt Yt− with (i, j) = (1, 2).1 One can easily
(i,j) (i,j)
show that γ Y () is an unbiased and consistent estimator of γY (). Moreover, if
{Yt , t ∈ Z} is a zero-mean i.i.d. process with E(Yt4 ) < ∞, it is easy to verify (Exercise
8.2(a)) that an exact expression of the variance of ψY () is given by
D
Under H0 : ψY () = 0, it can be shown that TR() −→ N (0, 1) as T → ∞. The
pre-requisite of the test statistic is that {Yt , t ∈ Z} must possess at least a finite
six-order moment. Note that this condition may often be viewed as too restrictive
for DGPs without higher-order moments, which typically is the case with financial
data.
Ramsey and Rothman (1996) recommend the following two-stage procedure for
testing Type I and II time-irreversibility.
(2,1) (1,2)
1
Y () − γ
The idea of using the difference γ Y () as a measure for TR is comparable to using
(2,1)
the difference between lag sample cross-correlations of standardized residuals, e.g. ρε () −
(1,2)
ρε () (see Example 6.8) as an alternative (omnibus-type) test statistic for diagnostic checking.
8.2 TIME-DOMAIN TESTS 319
(ii) Fit a causal ARMA(p, q) model to the standardized series {Yt }Tt=1 , using
an order selection criterion to find the optimal values of p and q. Obtain
T
the residuals and compute (8.5), replacing μr,Y by μ r,Y = T −1 t=1 Ytr
(r = 2, 3, 4).
(iii) Generate a new time series {Yt∗ }Tt=1 using the fitted model in step (ii), and
with {εt }Tt=1 generated as a sequence of i.i.d. N (0, 1) random variables. Ob-
tain the corresponding value of ψY ∗ (). Repeat this step a large number of
times.
(iv) Compute the sample standard deviation of ψY ∗ () via its simulated distri-
bution. Using the result in step (i), compute TR() for = 1, 2, . . . .
Two comments are in order. First, with some fitted linear ARMA models, direct
computation of the variance formula (8.5) may result in negative estimates. Step (iii)
overcomes this potential problem by simulating the distribution function of ψY ().
A second, and more serious problem, is that the ARMA prewhitening in step (ii)
may destroy TR since it induces a phase shift in the series; see Hinich et al. (2006).
As a consequence, the TR test statistic (8.6) could lead to false rejections of the null
hypothesis.
This result forms the basis of a TR test statistic proposed by Chen et al. (2000).
Let g(·) be a weighting function such that ∫0∞ g(ω)dω < ∞. More specifically,
g(·) should be chosen such that φX, (·) will not be integrated to zero when the
distribution of {Xt (), t ∈ Z} is asymmetric. A necessary condition is
∞ ∞ ∞
φX, (ω)g(ω)dω = sin(ωXt ())g(ω)dω dFXt, = 0, ∀ ∈ Z. (8.8)
0 −∞ 0
where ψg (x) = ∫0∞ sin(ωx)g(ω)dω. Given an observable segment {Yt }Tt=1 of {Yt , t ∈
Z}, and by abuse of notation, a natural point estimator of (8.9) is given by
1
T
Because ψg (·) is a static transformation, {Xt ()} and {ψg Xt () } are also strictly
stationary processes for each fixed ∈ Z. Then, under a minimal mixing condition
(see, e.g., White, 1984, Thm. 5.15), it is easy to show that, as T → ∞,
√
D
T − ψg () − μg () −→ N 0, σψ2 g () , (8.11)
where
1 T
σψ2 g () = lim Var √ ψg Xt ()
T →∞ T − t=+1
= Var{ψg Xt () }
T −−1
i
ψ2g () is a consistent estimator for σψ2 g (). Its form is given by
where σ
−−1
T
ψ2g ()
σ ψg (0) + 2
=γ WT, (j)
γψg (j),
j=1
8.2 TIME-DOMAIN TESTS 321
ψg (j) is the lag-j sample autocovariance of {ψg (Xt (l)); + 1 ≤ t ≤ T } and
where γ
j 1 j
WT, (j) = 1 − 1−
T − 2(T − )1/3
j 1 T −−j
+ 1− , (j ∈ N). (8.13)
T − 2(T − )1/3
The weight function (8.13) ensures that σ ψ2 g () is always non-negative. Its form
is motivated by the lag window used in the stationary bootstrap method of Politis
and Romano (1994) and adopted by Chen et al. (2000) and Chen (2003). These
latter authors further suggest to take g(ω) = (1/β) exp(−ω/β) (ω > 0), for some
β ∈ (0, ∞), so that ψg (x) = βx/(1 + β 2 x2 ). By adjusting the parameter β, the
resulting test statistic is flexible to capture various types of asymmetry. The test
statistic (8.12) seems to have high empirical power with β = 1 and β = 2.
Observe that (8.12) essentially is a general test statistic for detecting symmetry of
the marginal distribution of the observed time series {Yt }Tt=1 . It is a TR test statistic
when applied to {Xt ()}Tt=+1 . A useful feature of C g () is that the test statistic can
be used without any moment assumptions. 2 Indeed, simulations provided by Chen
et al. (2000) confirm that this test statistic is quite robust to the moment property
of the DGP being tested.
Unfortunately, the test statistic (8.12) is a check for unconditional symmetry
using the observed time series {Yt }Tt=1 . From an application perspective, however,
conditional symmetry is often of more interest. This implies that we need to replace
{Yt , t ∈ Z} by some residual series { εt }. In that case, Chen and Kuan (2002)
suggest to modify the computation of σ ψ2 g () by bootstrapping from the standardized
residuals of a time series model, using a model-free bootstrap approach. Provided the
first four moments of the error process {εt } exist, the resulting TR test statistic
is still
asymptotically normally distributed under the null hypothesis that E ψg (εt ) = 0.
where {εt } ∼ N (0, 1). Figure 8.2(a) shows a plot of a typical subset of
i.i.d.
D
STR −→ χ2M , (8.16)
where (ω1 , ω2 , ω3 ) ∈ [0, 1]3 are normalized frequencies, and the third-order cumulant
function is defined as γY (1 , 2 , 3 ) = E(Yt Yt+1 Yt+2 Yt+3 ). Owing to symmetry re-
lations, the trispectrum need to be calculated only in a subset of the complete
(ω1 , ω2 , ω3 )-space; see, e.g., Dalle Molle and Hinich (1995) for a description of nonre-
dundant regions of (8.17), including its principal domain.
The normalized magnitude of the trispectrum, known as the squared tricoher-
ence, can be expressed as
compute fY (ωj ) = T −1 Kk=1 |Yk (ωj )| with T ≈ KN . Next, replace steps (iii) and
2
1
K
fY (ωj1 , ωj2 , ωj3 ) = Yk (ωj1 )Yk (ωj2 )Yk (ωj3 )Yk (−ωj1 −ωj2 −ωj3 ).
T
k=1
∗ D
STR −→ χ2M ∗ (8.20)
This measure takes values in [0, 1] for any copula with the lower and upper bounds
attainable. Based on (8.21), Beare and Seo (2014) propose a TR test statistic for
the null hypothesis H0 : δC = 0. Using the notation in Section 8.1, let θ ∈ [0, 1/3]
be given by
which, in view of (8.21), implies that θ = 13 δC . Given a set of observations {Yt }Tt=1 ,
a natural empirical analogue of θ is
Under H0 and fairly weak regularity conditions, it can be shown (Beare and Seo,
2014) that θT is asymptotically distributed as
√ D
T θT −→ sup |B(y1 , y2 ) − B(y2 , y1 )|, as T → ∞, (8.23)
(y1 ,y2 )∈R2
conditional on the observed data {Yt }Tt=1 , the objective is to generate bootstrap
pseudo-replicates Y1∗ , . . . , YT∗ from which the statistic of interest, in the present case
(8.22), can be calculated.
For a first-order Markov chain the local resampling algorithm generating the
bootstrap replicates may be applied in the following way.
(ii) Let us suppose that for some t ∈ {1, . . . , T − 1} that Y1∗ , . . . , Yt∗ is already
∗
sampled. Now, for the (t + 1)th bootstrap observation set Yt+1 = YJ+1 ,
where J is a discrete random variable with probability mass function (pmf)
T −1
P(J = j) = Kh (Yt∗ − Yj )/ Kh (Yt∗ − Yi ), (j = 1, . . . , T − 1).
i=1
Recursive application of step (ii) yields the pseudo-time series {Yt∗ }Tt=1 . Notice
that the above procedure resamples the observed time series in a way according to
which the probability of Yj being selected is higher the closer is its preceding value
Yj−1 to the last generated bootstrap replicate Yt−1∗ .
One practical aspect is the choice of the initial bootstrap observation Y1∗ . A
simple approach is to draw at random from the entire set of observations {Yt }Tt=1
with equal probability. Another issue concerns the selection of h. One simple rule-of-
thumb approach is to use the ‘optimal’ resampling width, in the sense of minimizing
the AMSE of the bootstrap one-step transition distribution function; see Paparoditis
and Politis (2002). Assume that {Yt }Tt=1 is generated by an AR(1) process Yt =
φ0 + φ1 Yt−1 + εt with {εt } an i.i.d. sequence of random variables. Then, under the
simplifying assumption that {εt } ∼ N (0, σε2 ), it can be proved that the optimal
i.i.d.
σε4 W1 1/5
h(y) = , (y ∈ R), (8.24)
T fY (y){2σε2 C12 (y) + 0.25C22 }
√
where, with a Gaussian kernel, K1 = 1/(2 π), C1 (y) = φ1 σY−2 (y−μY ) and C2 = φ21 .
A sample version of h(y) can be easily obtained by fitting an AR(1) model to the
data, and replacing the unknown quantities in (8.24) by their sample estimates.
8.4 OTHER NONPARAMETRIC TESTS 327
be formulated in a state space framework via the joint density function fm (y) of
()
{Yt , t ∈ Z}, i.e., the process is invariant under time reversal for all m and if and
only if,
pdf defined as the convolution of fm (y) with a multivariate Gaussian kernel Kh (·),
i.e.,
∗
fm (y) = Kh (y − ξ)fm (ξ)dξ, (8.26)
Rm
where
√
Kh (x) = ( 2πh)−m exp{−
x
2 /2h2 },
with h > 0 the bandwidth, and
·
the Euclidean norm. The convolution process
has the symmetry-preserved property that fm ∗ (y) = f ∗ (Py) ∀y ∈ Rm under the
m
null hypothesis H0 : fm (y) = fm (Py). Then a quadratic measure to evaluate the
difference between the smoothed densities is defined as
2
1 √ m ∗ ∗
Qh (m) = (2h π) fm (y) − fm (Py) dy
2
R
m
√ m
∗ ∗ ∗ ∗
= (2h π) fm (y)fm (y) − fm (y)fm (Py) dy, (8.27)
Rm
which is always positive-semidefinite and equals zero if and only if fm ∗ (y) = f ∗ (Py).
m
Substituting (8.26) in (8.27), using integration by parts and a change of variables,
gives the expression
Qh (m) = fm (r) exp{−
r − s
2 /(4h2 )}
Rm Rm
− exp{−
r − Ps
/(4h )} fm (s)dsdr.
2 2
(8.28)
328 8 TIME-REVERSIBILITY
where
which, approximately, has a mean zero and a standard deviation one, if the m-
dimensional processes {Yi , i ∈ Z} and {Yj , j ∈ Z} are independent.
In applications of the test statistic Sh,T (m), an important question is how to
select the bandwidth h. In kernel-based estimation it is well known that selecting
h too small leads to a higher variance of the kernel estimator, called undersmooth-
ing. On the other hand, choosing a bandwidth that is too large increases the bias
(oversmoothing ) of the estimator. In practice, both factors are often balanced via
CV.
Another issue concerns the dependence among delay vectors. Diks et al. (1995)
suppress this effect by dividing the (i, j) plane of indices into squares of size τ × τ ,
with τ some fixed τnumber
τ larger than the typical time scale, and next replacing wij
by wi ,j = τ −2
p=1 q=1 wi τ +p,j τ +q . This method is supposed to provide more
reliable estimates of the standard deviation of Sh,T (m). Clearly, the influence of the
parameter τ on the performance of this test statistic is comparable to the bandwidth
influence. Moreover, since the parameters τ and h are bound together, the selection
of their optimal values should be carried out simultaneously, for instance by using
CV.
Figure 8.3: Boxplots of R(m) based on 1,000 MC replications of series of length T = 5,000
generated from the time-delayed Hénon map with dynamic noise process (8.35), and with (a)
= 1 and (b) = 2.
a strictly stationary and TR stochastic process {Xt () = Yt − Yt− , t ∈ Z}, we have
1
P(X0 () > 0) = P(X0 () < 0) = , ( = 1, . . . , m − 1).
2
The object of interest is thus the probability π() ≡ P X0 () > 0 , which may
be thought of as a simple measure of deviation from zero of the one-dimensional
distribution of {Xt (), t ∈ Z}. A natural point estimator of π() is
1
T
() =
π I Xt () > 0 , ( = 1, . . . , m − 1). (8.32)
T −
t=+1
T − π() − π() −→ N 0, σX
2
() , (8.33)
where
∞
2
σX () = π() 1 − π() + 2π() {P(Xt () > 0)|X0 () > 0) − π()}. (8.34)
t=1
The circular block bootstrap procedure of Politis and Romano (1992) for stationary
processes may be used to obtain an estimate of (8.34). A practical difficulty with
this approach is the choice of the block length. Another possibility is to approximate
the sampling distribution of (8.32) by subsampling, which requires the selection of
a subsample size. Below we present an example of the TR test statistic π () applied
to data generated by a nonlinear high-dimensional stochastic process.
Table 8.1: P -values of six TR test statistics. Blue-typed numbers indicate rejection of the
null hypothesis of TR at the 5% nominal significance level.
1
m−1
R(m) = |0.5 − π
()| × 100, (8.36)
m−1
=1
Table 8.2: Results of TR test statistic C g (), as defined by (8.12), for lags = 1, . . . , 10.(1)
Blue-typed numbers indicate rejection of the null hypothesis of TR at the 5% nominal signi-
ficance level.
Time lag
Series 1 2 3 4 5 6 7 8 9 10
Unemployment rate(2) 1.512 2.122 2.183 1.684 1.605 0.809 0.407 0.226 0.622 0.489
EEG recordings -0.257 -0.241 -0.285 -0.224 -0.222 -0.173 -0.104 -0.019 0.081 0.116
Magnetic field data -0.610 -0.479 -0.541 0.040 -0.286 -0.397 -0.334 -0.081 0.757 0.549
ENSO phenomenon 1.258 1.182 1.282 1.195 1.141 1.209 1.321 1.378 1.362 1.287
Climate change: δ 13 C 0.571 -0.299 -0.122 0.370 -0.016 -0.469 -0.384 -0.730 -0.622 -0.342
δ 18 O -0.548 -1.288 -1.660 -1.620 -1.320 -1.156 -1.104 -0.971 -0.574 -0.593
(1) Based on the exponential density function g(ω) = (1/β) exp(−ω/β) (ω > 0) with β set at
the reciprocal of the sample standard deviation of each series.
(2) First differences of original series.
change δ 13 C time series. For the remaining five series, TR is rejected at the 5%
nominal significance level. The p-values of the frequency-domain test statistic STR
(column 4) differ considerably from those of STR ∗ (column 5). For all time series TR is
∗ , while with S , evidence of time-irreversibility
strongly rejected on the basis of STR TR
is restricted to three series. Thus, the p-values of STR∗ rule out linear models with
Gaussian distributions for all series. Note, however, that these test results can be
sensitive to the choice of M ; see also the discussion in Section 4.4.4.
Except for the magnetic field data, the copula-based test statistic θT (column 6)
does not reveal evidence of time-irreversibility, at the 5% nominal significance level.
This may be due to the first-order Markov chain assumption used in the construction
of the test statistic; that is higher-order Markov chains may well provide a better
representation of the DGP underlying the time series, and consequently may change
the outcome of the test statistic.
The p-values of Sh,T (m) differ considerably across the values of m. For m = 2
and 3 all p-values do not reject TR at the 5% nominal significance level. For m = 4
and 5, we see that there is evidence of time-irreversibility in the EEG recordings.
Thus, it seems worthwhile not to rely completely on low-dimensional test results.
Table 8.2 presents test results of C g () for = 1, . . . , 10. Only in one case the
test statistic rejects the TR null hypothesis, i.e. the U.S. unemployment series at
lags = 2 and 3. In all other cases, the null hypothesis is not rejected at the
5% nominal significance level. Characterization of the U.S. unemployment series as
time-irreversible through the various TR test statistics suggest asymmetric behavior
consistent with the steepness asymmetry business cycle hypothesis, elaborated upon
in the introductory paragraph of this chapter. Also time-irreversibility of the EEG
recordings, as we observed in Table 8.1, is an indicator of nonlinear dynamics.
332 8 TIME-REVERSIBILITY
Section 8.4: Brendan Beare and Juwon Seo have made available MATLAB code for com-
puting the copula-based TR test statistic for Markov chains. The C++ source code and
a Linux/Windows executable of the kernel-based TR test statistic Sh,T (m) (Section 8.4.2)
can be downloaded from Cees Diks’ web page, located at http://cendef.uva.nl/people.
Exercises
Theory Questions
8.1 Let {Yt , t ∈ Z} be a strictly stationary i.i.d. process with mean zero, μ3,Y = E(Yt3 ) =
0, and finite moments μ2,Y = E(Yt2 ) and μ4,Y = E(Yt4 ). Verify (8.5).
8.2 Suppose that {f (t), t ∈ Z} is a strictly stationary time series process with mean zero,
defined on the interval [T1 , T2 ]. The bicovariance function of f (t) can be approximated
by
T2 −
1
γ (i,j) () = f i (t)f j (t + )dt, (i = j; ∈ Z).
(T2 − ) − T1 T1
(i,j)
Show that the bicovariance function γTR () of the time-reversed stochastic function
is not necessarily equal to γ (i,j) (), except when f (t) obeys time reversal, i.e. fTR (t) =
f (−t) = f (t + ξ), where ξ is an adjustable parameter that fixes the origin of the time
axis.
8.3 Consider the strictly stationary, zero-mean, stochastic process {Xt () ≡ Yt − Yt− , t ∈
(2,1) (1,1)
Z, ∈ N}. Let ρY () = E(Yt2 Yt− )/E(Yt2 )3/2 , and ρY () = E(Yt Yt− )/E(Yt2 ).
(2,1) (1,1)
(b) Assume that the functions ρY () and ρY () are differentiable on [0, ∞).
Show the above expression is approximately given by
where ρ21 (0) and ρ11 (0) denote the first non-zero derivatives of ρY
(2,1)
() and
(1,1)
ρY () at the origin, respectively.
(c) Using part (b), argue that as ↓ 0 time-irreversibility is most apparent for small
values of .
(Cox, 1991)
8.4 The Gamma distribution is often used to model a wide variety of positive valued
time series variables. Applications include fields such as hydrology (river flows), met-
eorology (rainfall, wind velocities), and finance (intraday durations between trades).
EXERCISES 335
Within this context, Lewis et al. (1989) introduce the simple first-order Beta-Gamma
autoregressive (BGAR(1)) process
Yt = Bt Yt−1 + Gt , (t ∈ Z),
where {Bt } and {Gt } are mutually independent sequences of i.i.d. random variables
with Beta(kρ, k(1−ρ)) and Gamma(k(1−ρ), β) distributions, respectively, with shape
parameter k > 0, rate parameter β > 0, and ρ (0 ≤ ρ < 1) describes the dependency
structure of the process. It is easily established, using moments of Beta variables,
that ρ() = ρ|| ( ∈ Z).
(a) Let Y and B be independent Gamma(k, β) and Beta kρ, k(1 − ρ) random vari-
be shown
that BY and (1 − B)Y are independent
ables respectively. Then it can
Gamma(kρ, β) and Gamma k(1 − ρ), β variables. Using this result, prove that
the Laplace–Stieltjes transform of the random variable (v + Bu)X (v ≥ 0, u ≥ 0)
is given by
β k(1−ρ) β kρ
E e−(v+Bu)X = .
β+v β+v+u
(c) Given the result in part (b), state your conclusion about the TR of the BGAR(1)
process.
where {εt } is a sequence of i.i.d. random variables with a continuous marginal distri-
bution. The process {Yt , t ∈ Z} may be viewed as a stochastic version of the so-called
Anosov diffeomorphism on a two-dimensional torus, i.e.
yi+1 1 1 yi
= (mod 1),
xi+1 1 0 xi
(c) The joint distribution of each of the pairs (Yt− , Yt ) ( ≥ 1) is symmetric with
respect to the matrix operator P, defined as P(y1 , y2 ) = (y2 , y1 ).
(a) Investigate the six time series in Table 8.1 for the presence of TR using S(), i.e.
(0)
test the null hypothesis H0 : f (y) = f (−y) ∀y. To reduce the computational
burden, set the number of BS replicates at 99.
(1)
(b) Repeat part (a), but now test the null hypothesis H0 : f (Yt , Yt−1 ) = f (Yt−1 , Yt ).
Are there any marked difference between the test results in parts (a) and (b)?
4
Also known as the Bhattacharyya–Matusita–Hellinger measure of dependence; see Bhat-
tacharyya (1943), and Matusita (1955).
Chapter 9
SEMI- AND NONPARAMETRIC
FORECASTING
The time series methods we have discussed so far can be loosely classified as para-
metric (see, e.g., Chapter 5), and semi- and nonparametric (see, e.g., Chapter 7). For
the parametric methods, usually a quite flexible but well-structured family of finite-
dimensional models are considered (Chapter 2), and the modeling process typically
consists of three iterative steps: identification, estimation, and diagnostic checking.
Often these steps are complemented with an additional task: out-of-sample fore-
casting. Within this setting, specification of the functional form of a parametric
time series model generally arrives from theory or from previous analysis of the
underlying DGP; in both cases a great deal of knowledge must be incorporated in
the modeling process. Semi- and nonparametric methods, on the other hand, are
infinite-dimensional. These methods assume very little a priori information and
instead base statistical inference mainly on data. Moreover, they require “weak”
(qualitative) assumptions, such as smoothness of the functional form, rather than
quantitative assumptions on the global form of the model.
For all these reasons, a practitioner is often steered into the realm of semi- and
nonparametric function estimation or “smoothing”. However, the price to be paid is
that parametric estimates typically converge at a root-n rate, while nonparametric
estimates usually converge at a slower rate. Also, semi- and nonparametric methods
acknowledge that fitted models are inherently misspecified, which implies specifica-
tion bias. Increasing the complexity of a fitted model typically decreases the absolute
value of this bias, but increases the estimation variance: a feature known as the bias-
variance trade-off. The bandwidth or tuning parameter controls this trade-off, i.e.
its choice is often critical to implementation and practical consideration.
In this chapter, we deal with various aspects of semi- and nonparametric mod-
els/methods with a strong focus on forecasting. The desire for forecasting future
time series values, along with frequent misuse of methods based on linear or Gaus-
sian assumptions, motivates this area of interest. Based on results in Appendix 7.A,
the first half of this chapter is concerned with kernel-based methods for estimat-
ing the conditional mean, median, mode, variance, and the complete conditional
density of a time series process. We examine and compare the use of single-stage
versus multi-stage quantile prediction. Further, we describe kernel-based methods
for jointly estimating the conditional mean and the conditional variance. This part
also includes methods for estimating multi-step density forecasts using bootstrap-
ping, and methods for nonparametric lag selection.
The second half of the chapter deals with semiparametric models/methods. It
is well known that conventional nonparametric estimators can suffer poor accuracy
for data of dimension two and higher. In fact, the number of observations needed
to attain a fixed level of estimate confidence grows exponentially with the number
of dimensions. This problem is called the curse of dimensionality and presents
a dilemma for the effective and practical use of nonparametric forecast methods.
One way to circumvent this “curse” is to use additive models. These models make
the assumption that the underlying regression function may have a simpler, additive
structure, comprising of several lower-dimensional functions. As such, they fall in the
class of semiparametric models/methods, combining parametric and nonparametric
features. In Section 9.2, we discuss several additive (semiparametric) models for
time series prediction with emphasis on conditional mean and conditional quantile
forecasts. Then, in Sections 9.2.5 and 9.2.6, we introduce two restricted, and closely
related, forms of a semiparametric AR model.
the covariate Xt , and that μ(·) is defined through the conditional distribution. Given
a loss function L(·) with a unique minimum,
define μ(·) such that it minimizes the
conditional mean E L(Zt − a)|Xt = x with respect to a, i.e.
Then estimating nonparametrically μ(·) by μ (·) and calculating μ (XT −p+1 ) gives
ZT −p+1 . In this way, we obtain the H-step ahead forecast value YT +H|T as an
estimator of YT +H|T = E(YT +H |XT ).
Using the above principle, we define three predictors, i.e. the conditional mean,
the conditional median, and the conditional mode, each depending on a particular
form of the function L(·). These predictors will be expressed as a sum of products
between functions of {Yt } and weights Wt (x), depending on the values of Xt , i.e.
the weights are defined as
x − X 3
n x − X
t t
Wt (x) = K K , (n = T − H − p + 1). (9.3)
hn hn
t=1
Hence, given {Yt , t ≤ T }, the H-step ahead nonparametric estimator of the condi-
tional mean is defined as
n
YTMean
+H|T = Zt Wt (XT −p+1 ). (9.5)
t=1
Under certain mixing conditions of the process {(Xt , Zt ), t ∈ Z}, Collomb (1984)
shows uniform convergence of YTMean
+H|T .
Conditional median
When the conditional distribution of Zt given Xt is heavy-tailed or asymmetric, it
may be sensible to use the conditional median rather than the conditional mean to
generate future values, as the median is highly resistant against outliers. In this
case the loss function is given by L(u) = |u|, and the solution of (9.2) leads to the
conditional median function ξ(x) = inf{z : F (z|x ≥ 1/2)}. Here, F (·|·) is the CDF
of Zt given Xt = x. Estimating ξ(·) nonparametrically gives
n
ξ(x) = inf z : Wt (x)I(Zt ≤ z) ≥ 1/2 . (9.6)
t=1
340 9 SEMI- AND NONPARAMETRIC FORECASTING
Hence, given {Yt , t ≤ T }, the H-step ahead nonparametric estimator of the condi-
tional median, denoted by YTMdn
+H|T , is defined as
n
YTMdn
+H|T = inf z : Wt (XT −p+1 )I(Zt ≤ z) ≥ 1/2 . (9.7)
t=1
Conditional mode
Collomb et al. (1987) propose a method to produce nonparametric predictions based
on the conditional mode function. In this case, we have a non-convex loss function
with a unique minimum L(u) = 0 when u = 0, and L(u) = 1 otherwise. The
solution of (9.2) leads to the conditional mode function τ (x) = arg maxz∈R f (z|x),
where f (·|x) denotes the conditional density function of Zt given Xt = x. Estimating
τ (·) nonparametrically gives
n z − Z
t
τ(x) = arg max K Wt (x). (9.8)
z∈R h
t=1
n z − Z
YTMode
t
+H|T = arg max K Wt (XT −p+1 ). (9.9)
z∈R h
t=1
Under some mixing conditions on {(Xt , Zt ), t ∈ Z}, Collomb et al. (1987) show the
uniform convergence of YTMode
+H|T .
The predictors defined above are direct estimators since they use direct smooth-
ing techniques. Clearly, these predictors are point estimates of a particular loss
function L(·) at some x. However, they do not estimate the whole loss function. In
fact, the H-step ahead conditional mean, median, and mode all ignore information
contained in the intermediate variables Xt+1 , . . . , Xt+(H−1) . In Section 9.1.2, we
introduce a nonparametric kernel smoother which uses such information.
C, such that |f (x1 ) − f (x2 )| ≤ C|x1 − x2 | ∀x1 , x2 ∈ D. The Lipschitz requirement is necessary for
proving uniform convergence results.
9.1 KERNEL-BASED NONPARAMETRIC METHODS 341
the DGP is Markovian, and imposing proper (regularity) conditions, the leave-one-
out CV method can be extended to time series processes.
Table 9.1 gives leave-one-out estimators of the conditional mean, median, and
mode with corresponding CV measures. The optimal bandwidth follows from hopt =
arg minh {CV (·) (h)}, where the superscript (·) denotes one of the three predictors.
Then, given hopt, the H-step ahead nonparametric predictor follows directly. When
a time series is strongly correlated, it is reasonable to leave out more than just one
observation. For nonparametric density estimation of i.i.d. observations, the plug-in
bandwidth hd = σ Y T −1/(p+4) can be used with σY the standard deviation of {Yt }Tt=1 .
This choice is a simplified version of expression (A.10) in Chapter 7, with ν = 2. It
guarantees an optimal rate of convergence with respect to the MISE. However, hd is
not optimal in all cases since it does not take into account the mixing condition of
the stochastic process. Nevertheless, it may serve as an initial pilot for CV methods.
T
T
2
Yt − Yt+1|t (p, h) , Yt − Yt+1|t (p, h) ,
(·) (·)
f1 (p) = f2 (p) =
t=T −k t=T −k
and
where Yt+1|t (p, h) denotes the one-step ahead kernel-based predictor (i.e. conditional
(·)
mean, median, or mode) depending on the Markov coefficient p and the bandwidth h.
The value of p is chosen as follows. For a fixed h, obtain pj = arg minp fj (p) for each
j, and subsequently p = maxj pj (j = 1, 2, 3). For series with T ≥ 100 observations,
it is recommended to take k = [T /5], and k = [T /4] otherwise. This procedure
is simple and quick. Nevertheless, there is a need for its theoretical underpinning.
Section 9.1.6 discusses alternative methods of lag selection.
Table 9.1: Leave-one-out estimators of the conditional mean, the conditional median, and
the conditional mode with corresponding CV measures.
Equivalently, ξq (x) can also viewed as any solution to the following problem
where ρq (u) = |u|+(2q−1)u is the so-called check function. Note that ξ1/2 (x) ≡ ξ(x),
i.e. the conditional median.
Now, given the observations {(Xt , Zt )}nt=1 , an estimator ξq (x) of ξq (x) can be
defined as the root of the equation F(z|x) = q where F(·|x) is an estimator of F (·|x).
Thus, a predictor of the qth conditional quantile of YT +H is given by ξq (XT −H−p+1 ).
Of course, in practice a nonparametric estimate of the conditional distribution func-
tion is needed. One possible estimator is the NW smoother which in a time series
setting is given by
n
K{(x − Xt )/h}I(Zt ≤ z)
F (z|x) = t=1n , (n = T − H − p + 1). (9.12)
t=1 K{(x − Xt )/h}
F(z|x) = q (9.13)
as the single-stage conditional quantile predictor and denote this by ξqNW (x). Altern-
atively, we may use the local linear (LL) conditional quantile estimator; see Section
9.1.3 for its definition.
Note that the conditional quantile predictor in (9.13) uses only the information
in the pairs {(Xt , Zt )}nt=1 and ignores the information contained in
(1) (2) (H−1)
Wt = Xt+1 , Wt = Xt+2 , ..., Wt = Xt+(H−1) . (9.14)
9.1 KERNEL-BASED NONPARAMETRIC METHODS 343
Below we illustrate the impact of the data contained in (9.14) on multi-step ahead
prediction accuracy.
(H−1)
= E G 2 (Wt )|Xt = x ,
.. (9.17)
.
(1)
Observe that as we go down line by line in (9.17) more and more information is
utilized. Recalling the two previous inequalities, (9.15) and (9.16), we can see that as
more information is used, the prediction variance gets smaller and hence prediction
accuracy in terms of MSFE improves. Thus, at least in theory, it pays off to use all
the ignored information.
Based on the above recursive setup, we now introduce a kernel-based estimator
of F (z|x). First the estimators of G 1 (w) and G j (w), (j = 2, . . . , H − 2) are defined,
respectively, as follows.
n (H−1)
1 (w) = t=1 K{(w − Wt )/h1 }I(Zt ≤ z)
Stage 1: G n (H−1)
,
t=1 K{(w − W t )/h 1 }
n
K{(w − Ws
(H−j) j−1 Ws(H−(j−1))
)/hj }G
Stage j: j (w) =
G s=1
n .
(H−j)
s=1 K{(w − W s )/hj }
We shall refer to the root of the equation F(z|x) = q as the multi-stage qth condi-
tional quantile predictor ξqNW (x).
To compare the AMSE of ξqNW (x) (multi-stage) with the AMSE of ξqNW (x) (single-
stage), we assume for simplicity of notation that H = 2, and p = 1. From {Yt , t ∈ Z},
let us construct the associated process Ut = (Xt , Wt , Zt ) defined by
(1)
Xt = Yt , Wt = Wt = Yt+1 , Zt = Yt+2 .
We suppose that the random variables {(Xt , Wt )}, respectively {(Wt , Zt )}, have joint
densities fX,W (·, ·), respectively fW,Z (·, ·). Let g(x), g(z), and g(w) be the marginal
densities of {Xt }, {Zt }, and {Wt }, and f (·|x) = fX,Z (x, ·)/g(x) be the conditional
density function. Furthermore, we assume that some regularity conditions on the
process {Ut , t ∈ Z} are satisfied, and that nh → ∞ as n → ∞, nh1 → ∞ as n → ∞
and h1 = o(h2 ).
5n−4/5
AMSE{ξqNW (x)} #
D24/5 ξq (x), x D11/5 ξq (x), x , (9.19)
4f 2 ξq (x)|x
5n−4/5
AMSE{ξqNW (x)} # 2
D34/5 ξq (x), x D11/5 ξq (x), x , (9.20)
4f ξq (x)|x
where
2F (1,0) (y|x)g (1) (x) 2
D1 (y, x) = μ22 (K) F (2,0) (y|x) + ,
g(x)
R(K)σ 2 (y, x) v1 (y, x)
D2 (y, x) = , D3 (y, x) = R(K) ,
g(x) g(x)
with
v2 ξq (x), x 4/5
r ξq (x), x = 1 +
, (9.21)
v1 ξq (x), x
Figure 9.1: Ratio of asymptotic best possible AMSEs (r) versus the quantile level q. From
De Gooijer et al. (2001).
It
is easy
to verify that
Var ξq (x),
x = q(1 − q). Further, note that
Var ξq (x), x = v1 ξq (x), x + v2 (ξq (x), x with v2 ≤ q(1 − q). Thus, we may re-
express (9.21) as follows: r ξq (x), x = {q(1−q)/ q(1−q)−v2 (ξq (x), x) }4/5 . Figure
9.1 shows a plot of r versus q (0.1 ≤ q ≤ 0.9) for v2 = 0.05 and 0.08. Clearly, r
increases sharply as we go to the edge of the conditional distribution. This illus-
trates theoretically that the improvement achieved by ξq (x) is more pronounced for
quantiles in the tails of F (·|x).
From asymptotic theory it follows that the optimal bandwidth for both predictors
depends on q. Thus, the amount of smoothing required to estimate different parts of
F (·|x) may differ from what is optimal to estimate the whole conditional distribution
function. This is particularly the case for the tails of F (·|x). We can, however, turn
to the following rule-of-thumb calculations based on assuming a normal (conditional)
distribution as an appropriate approach:
(a) Select a primary bandwidth, say hmean , suitable for conditional mean estima-
tion. For instance, one may use hrot as given by (A.7) in Appendix 7.A with
a Gaussian second-order kernel. Alternatively, various ready-made bandwidth
selection methods for kernel-type estimators of μ(·) are available in the liter-
ature.
where ϕ(·) and Φ(·) are the standard normal density and distribution functions,
respectively, and p refers to the order of the Markovian
−1 process.
2 In particular,
when q = 1/2, h1/2 = hmean (2/π) 1/(p+4) using ϕ Φ (1/2) = (2π)−1 .
Figure 9.2: (a) – (c) Percentile plots of the empirical distribution of the squared errors for
model (9.23) for the single-stage predictor ξqNW (·) (blue solid line), and the multi-stage (here
two) predictor ξqNW (·) (black solid line); (d) – (f ) Boxplots corresponding to the percentile
plots (a) – (c), respectively; T = 150, and 150 MC replications. From De Gooijer et al.
(2001).
where {εt } ∼ N (0, 1) random variables with the standard normal distribution
i.i.d.
truncated in the interval [−12, 12]. The objective is to estimate two and five
steps ahead q-conditional quantiles using both ξqNW (x) and ξqNW (x) (q = 0.25
and 0.75; x = 6 and 10), and compare their prediction accuracy.
Clearly, a proper evaluation of the accuracy of both predictors requires know-
ledge about the “true” conditional quantile ξq (x). This information is obtained
by generating 10,000 independent realizations of (Yt+H |Yt = x) (H = 2 and 5)
iterating the DGP (9.23) and computing the appropriate quantiles from the
empirical conditional distribution function of the generated observations.
From (9.23), we generate 150 samples of size T = 150. Based on these estim-
ates, we compute for each replication j (j = 1, . . . , 150) the following error
measures:
where ξq (x) and ξq (x) denote the jth estimators ξqNW (x) and ξqNW (x), re-
(j) (j)
Figure 9.3: Old Faithful geyser data set: (a) Duration of eruption plotted against waiting
time to eruption, and (b) conditional density estimates of eruption duration conditional on
the waiting time to eruption. Time period: August 1, 1985 – August 15, 1985 (T = 299).
From De Gooijer and Zerom (2003).
In the sequel, we first discuss two existing kernel-based smoothers of the condi-
tional density: the NW estimator and the LL estimator. Next, following De Gooijer
and Zerom (2003), we introduce a simple kernel smoother which combines the bet-
ter sides of both estimators. For simplicity, we shall consider the case p = 1, i.e.
{Xt , t ∈ Z} is a univariate process.
where
Kh (x − Xt )
WtNW (x) = n 2 .
t=1 Kh2 (x − Xt )
Now, suppose that the second derivative of f (y|x) exists. Also, introduce the
short-hand notation f (i,j) (y|x) = ∂ i+j f (y|x)/∂xi ∂y j . In a small neighborhood of a
9.1 KERNEL-BASED NONPARAMETRIC METHODS 349
The LL estimator of f (y|x), here denoted by fLL (y|x), is defined as β0 . Simple
algebra (Fan and Gijbels, 1996) shows that fLL (y|x) can be expressed as
n
fLL (y|x) = Kh1 (y − Yt )WtLL (x), (n = T − p), (9.25)
t=1
where
Kh2 (x − Xt ){Tn,2 − (x − Xt )Tn,1 }
WtLL (x) = ,
(Tn,0 Tn,2 − Tn,1
2 )
with Tn,j = nt=1 Kh2 (x − Xt )(x − Xt )j (j = 0, 1, 2).
From the definition of the two estimators, we can see that fNW (y|x) ap-
proximates f (y|x) locally by a constant while fLL (y|x) approximates f (y|x) locally
by a linear model. To appreciate why the extension of the local constant fitting to
the local linear alternative is interesting, we now compare the two estimators via
their respective moments. To keep the presentation simple, we assume without loss
of generality, that h1 = h2 = h. When the process {(Xt , Yt ), t ∈ Z} is α-mixing it
can be shown (Chen et al., 2001) that the approximate asymptotic bias and variance
of fNW (y|x) is given by
1 g (1,0) (x) (1,0)
Bias fNW (y|x) = μ2 (K)h2 f (2,0) (y|x) + f (0,2) (y|x) + 2 f (y|x)
2 g(x)
(9.26)
and
1 f (y|x)
Var fNW (y|x) = R2 (K) 2 , (9.27)
nh g(x)
+ +
where μ2 (K) = R u2 K(u)du and R(K) = R K 2 (z)dz are defined earlier in Ap-
pendix 7.A. Similarly, it can be shown (Fan and Gijbels, 1996, Thm. 6.2) that the
350 9 SEMI- AND NONPARAMETRIC FORECASTING
Note that the two variances are identical and the differences in the AMSEs
between the two estimators depend only on their respective
biases. We see that the
bias of fNW (y|x) has an extra term g (1,0) (x)/g(x) f (1,0) (y|x). The bias of fNW (y|x)
is large if either |g (1,0) (x)/g(x)| or |f (1,0) (y|x)| is large, but neither term appears in
(9.28). For example, when the marginal density function of X (design density) is
highly clustered, the term |g (1,0) (x)/g(x)| becomes large. Of course, when g(x) is
uniform, the biases of the two estimators are the same. Thus, the fact that fLL (y|x)
does not depend on the density of X makes it design adaptive (see, e.g., Fan, 1992).
Now, let’s consider |f (1,0) (y|x)|. For simplicity, suppose that the conditional density
of Y depends on x only through a
location parameter, say the conditional mean
n
τt (x)(x − Xt )Kh (x − Xt ) = 0. (9.30)
t=1
n
fRNW (y|x) = Kh (y − Yt )WtRNW (x), (9.31)
t=1
9.1 KERNEL-BASED NONPARAMETRIC METHODS 351
where
τt (x)Kh (x − Xt )
WtRNW (x) = n .
t=1 τt (x)Kh (x − Xt )
Setting ∂Ln (·, ·)/∂τt (x) = 0, we obtain τt (x) = 1/{κ + nλ(x − Xt )Kh (x − Xt )}. In
addition, summing ∂Ln (·, ·)/∂τt (x) and employing (9.30), we can see that κ = n.
Hence,
# $−1
τt (x) = n−1 1 + λ(x − Xt )Kh (x − Xt ) . (9.32)
So, a zero of G(·) is a stationary point of Ln (·). The implication is that, in prac-
tice, one can compute λ as the unique minimizer of Ln (·). De Gooijer and Zerom
(2003) suggest that a line search algorithm is a suitable choice to compute λ. The
conditional density function displayed in Figure 9.3(b) is computed via the RNW
smoother.
It is straightforward to show (De Gooijer and Zerom, 2003) that |λ| ≤ Op (h).
Moreover, the bias and variance of fRNW (·) are identical to the bias and variance of
the LL smoother respectively given by (9.28) and (9.29). Thus, the RNW smoother
shares the better bias behavior of the LL smoother. If one chooses the optimal
bandwidth, say h∗ , such that it minimizes the AMSE of fRNW (·), it is easy to see
that
h∗ = Bn−1/6 ,
y directions. Recall that in defining the RNW smoother we used one bandwidth
h = h1 = h2 . However, in practice there may indeed arise a need to have different
levels of smoothing for each direction. For example, in the Old Faithful geyser
illustration, it is not advisable to have the same h for both variables because they
have different levels of variability. In fact, that was the reason for standardizing
the variables before using a single bandwidth for both. If the approach of pre-
standardizing the data is found inadequate, the RNW smoother can be easily re-
defined to involve two bandwidths.
K-nearest neighbors
In an i.i.d. setting the method of k-nearest neighbors (k-NN) is a simple, yet powerful
and versatile, nonparametric pattern recognition procedure. Within a time series
context the intuition underlying the k-NN approach is that the DGP causes patterns
of behavior to be repeated in {Yt }nt=1 with n = T − p. If a previous pattern can be
identified as most similar to the current behavior of Yt , then the previous subsequent
behavior of the series can be used to predict behavior in the immediate future.
Here, the objective is to produce a nonparametric estimator of the conditional mean
μ(x) = E(Yt+1 |Xt = x) using the kn < n vectors closest to Xn in the training, or
fitting set F t = {Xt |t = 1, . . . , n}. To this end, we define a neighborhood around
x ∈ Rm such that N (x) = {i|i = 1, . . . , kn whose X(i) represents the ith-nearest
neighbor of x in the sense of a given semi-metric, say D(x, X(i) )}.
Let K(·) denote a kernel function on Rm . Then the k-NN estimator of μ(x) is
defined as
k-NN (x) =
μ Y(i)+1 W(i) (x), (9.33)
X(i) ∈F t
i∈N (x)
where
W(i) (x) = n −1
n
, if K Hk−1
n
D(x, X(i) ) = 0,
i=1 K Hkn D(x, X(i) ) i=1
9.1 KERNEL-BASED NONPARAMETRIC METHODS 353
and where Hkn is the bandwidth, defined as the distance to the furthest neighbor,
i.e. Hkn ≡ D(x, X(kn ) ). Two-step ahead forecasts can be obtained along the same
lines as above using the data set {Y1 , . . . , Yn , μ
k-NN (x)}.
Clearly, a weighting scheme is necessary to combine the forecasts implied by
each neighbor. When K(u) = I(
u
p ≤ 1), the kernel weights are just the uni-
form weights, i.e. W(i) (x) = 1/kn ∀i. Using these weights, and some weak mixing
conditions, Yakowitz (1987) shows that AMSE{ μk-NN (x)} = O(n−4/(p+4) ). He also
k-NN (x).
establishes asymptotic normality of μ
Note that the k-NN method can be thought of as a kernel regression in which
the size of the local neighborhood around x is allowed to vary, thus providing a
large window around x when the data are sparse. The k-NN kernel estimate is
also automatically able to take into account the local structure of the data. This
advantage, however, may turn into a disadvantage. If there is an outlier in the data,
the local prediction may be bad; see, however, below for a robustification of the
k-NN method. Typically, kn is chosen on the order of magnitude n1/2 , but can be
selected using a procedure such as (G)CV. Traditionally, the Euclidean semi-metric
is chosen as a distance measure.
Loess/Lowess
The acronyms “loess” and “lowess” both refer to a nonparametric method to calcu-
late an estimate of μ(x) = E(Yt+1 |Xt = x) using locally weighted regression (LWR)
to smooth data. LWR was first introduced by Cleveland (1979) and further de-
veloped by Cleveland and Devlin (1988). The basic underlying model supposes that
Yt = μ(Xt ) + εt , (9.34)
where μ(·) is a smooth function mapping Rp → R, and {εt } ∼ (0, 1). LWR is a
i.i.d.
numerical approach that describes how μ (x∗ ), the estimate of the unknown function
∗
μ(·) at the specific value x , is estimated using a local Taylor series approximation
of order d. Let f be a “smoothing” parameter such that 0 < f ≤ 1, and let qf =
[f × n]. Then the LWR uses the “window” of qf observations nearest to x∗ , where
proximity is defined by the distance D(·, ·), commonly taken as the Euclidean norm.
In summary, the basic steps to calculate an estimate of μ(x) = E(Yt+1 |Xt = x) are
as follows.
(ii) For each {Xt }nt=1 compute the ordered values of the distances D(x, X(i) )
with X(i) the ith-nearest neighbor of x as in (9.33).
354 9 SEMI- AND NONPARAMETRIC FORECASTING
(iv) Perform a LWR over the span of values. For lowess, set the order of the
polynomial at d = 1, i.e. the regressions are based on LL–fits. For loess, set
d = 2 (local polynomial or quadratic fits). The estimate of μ(·) is simply the
estimate of the parameter β0 from the corresponding LS regression.
Note, the parameter f indicates the fraction of data used in the LWR proced-
ure, analogous to the bandwidth in kernel smoothing. As f increases much more
smoothing is done. Since the LWR estimate of μ(·) is linear in Yt , the asymptotic
properties (e.g. consistency) of the estimator can be derived (Stone, 1977) using
standard techniques provided that as n → ∞, qf → ∞, but qf /n → 0. If the
data set contains outliers, it is generally recommended to use a robust variant of
Algorithm 9.1. Basically, the robust LWR procedure involves the following steps.
Algorithm 9.2: Robust Loess/Lowess
(i) Compute the residuals {εt }nt=1 from a k-NN pilot estimate of μ(·), and s =
Mdn{| εt |}.
(ii) Calculate the robustness weights δt which are defined as δt = K εt /(6s) ,
where K(·) denotes the biweight second-order kernel function given in Table
7.7 of Appendix 7.A.
(iv) Given the smoothed values from step (ii), compute the next set of residuals
and a new set of robustness weights.
(v) Repeat the previous two steps a few times (by default three times in the R and
S-Plus implementations of loess/lowess). This produces the final estimate of
μ(·).
Figure 9.4: (a) Lowess curve fitted to the hourly river flow data set; (b) Robust lowess
curve fitted to the hourly river flow data set; m = 1 and f = 0.1.
the soil is infiltrated to its full capacity due to prior rainfall or melting of snow
and, as a consequence, river flow will be significantly higher than if the soil has
dried out through lack of external sources. There is no discernible long term
nonlinearity caused by evapotranspiration. Here, we ignore the information
that the major effect on the river flow behavior comes from the amount of
rainfall with a few hours delay.
Plot (a) suggests that the lowess method gives a very good identification of
the base flow effects, but extreme peaks, or “outliers”, are less well explained.
Plot (b) shows that the robust lowess method reflects the outlier influences
slightly better (R2 = 0.999) than the non-robust lowess method (R2 = 0.996)
with smoothed values quite close to the observed data (red dots).
unknown functions on R. The first objective is to estimate μ(·) and σ(·) jointly from
T available observations using methods analogous to those for estimating conditional
means. In the second part, we focus on the complete conditional density.
T T
t=p+1 KH (x − Xt )Yt t=p+1 KH (x − Xt )Yt
2
μNW
(x) = T , (x) =
σ 2
T − {
μNW (x)}2 .
t=p+1 KH (x − Xt ) t=p+1 KH (x − Xt )
(9.36)
Masry and Tjøstheim (1995) establish strong consistency and asymptotic normality
of these estimators for α-mixing processes.
In an analogous fashion, we can adopt LL estimators and other nonparametric
regression methods to estimate μ(·) and σ(·) jointly. However, there is no a priori
reason to assume that the only features of the conditional distribution that depend
on Xt are the mean and the variance. Hence, it seems reasonable to obtain a
complete conditional density estimate of Yt given Xt = x. The basic setup is as
in Section 9.1.3. Then, assuming a single bandwidth h, a kernel estimate of the
conditional (one-step ahead) density f (·|x) associated with (9.35) is given by
T
(T hp+1 )−1 t=p+1 Kp+1 [{(y, x) − (Yt , Xt )}/h]
fNW (y|x) = T , (9.37)
−1
(T p) t=p+1 Kp {(x − Xt )/h}
where Kp+1 (·) denotes a p+1 dimensional kernel function, commonly of the product
form. Robinson (1983) establishes a CLT for this estimator. For H ≥ 2, the fore-
cast transition density can be obtained by applying an iterative scheme; see, e.g.,
Algorithm 9.3.
Singh and Ullah (1985) extend the above results to the estimation of the condi-
tional density of a (jointly) strictly stationary real-valued bivariate process {(Xt , Zt ),
t ∈ Z} with Zt = (Zt , . . . , Zt−q ) (q ≥ 0). Moreover, they establish a CLT under far
weaker mixing conditions than those used in Robinson (1983).
3 T
−1
Wt (x) = Kh1 (x − Xt ) Kh1 (x − Xt ), (9.38)
t=p+1
where h1 > 0 is a bandwidth and Kh1 (·) = K1 (·/h1 )/h1 with K1 (·) a sym-
metric kernel function (e.g., the Gaussian product kernel).
1.2 Using (9.38), resample with replacement from the successors of Xt , i.e.,
YT∗+1 = YJ+1 where J is a discrete random variable taking its value in the
set Sp,T .
∗,(b)
1.3 Repeat steps 1.1 – 1.2 B times, to obtain the bootstrap replicates {YT +1 }B
b=1 .
H ≥ 2 (Multi-step ahead):
2.1 Move n one period forward, i.e., n = T + 1, and update Xn accordingly,
i.e., X∗n = (Yn∗ , Yn−1 , . . . , Yn−p+1 ) . Compute new weights using an updated
version of (9.38). Resample with replacement from the successors of X∗t , i.e.,
YT∗+2 = YJ+1 .
2.2 Keep moving n forward one step. Repeat step 2.1 until n = T + H − 1 by
updating Xt .
∗,(b)
2.3 Repeat steps 2.1 – 2.2 B times, to obtain {YT +H }B
b=1 .
Using another bandwidth h2 > 0 (i.e., h2 ∼ B −1/5 ) and kernel K2 (·), compute
the H-step ahead MFD kernel estimator, say fTMFD
+H (·|XT ), from the B-bootstrap
replicates in steps 1.3 and 2.3.
By Algorithm 9.3 the values of the probability weights depend on how “close”
the vectors Xt are to the conditioning vector Xn . That is, the closer Xt is to Xn the
larger weight it receives as compared to state vectors that are further away. In so
doing the method actually defines for each time point t ∈ Sp,T a local neighborhood
from which the value YT∗+H is obtained, and hence its name local bootstrap. Under
certain mixing conditions on the associated process {(Xt , Zt )} ∈ Rp × R where
Zt = Yt+H and some technical assumptions Manzan and Zerom (2008) demonstrate
the asymptotic validity of MFD when H ≥ 2.
To accurately capture the dependence structure of the data, the following ap-
proach for the selection of h1 is recommended:
(i) Compute a pilot density estimate fhrot (Xt )=(T −p)−1 t∈Sp,T Khrot (XT −Xt ),
Y N −1/5 , where σ
using hrot = σ Y is the standard deviation of {Yt }N
t=1 .
(ii) Compute the local bandwidth factor λt = {fhrot (Xt )/g}−γ where g is the
358 9 SEMI- AND NONPARAMETRIC FORECASTING
geometric mean of fhopt (Xt ), i.e., log g = (1/T ) Tt=1 log fhopt (Xt ), and γ (0 ≤
γ ≤ 1) is a sensitivity parameter that regulates the amount of weight that is
attributed to the observations in the low density regions. In terms of lowest
MSE, a good choice is γ = 1/2; see Silverman (1986).
(iii) Compute the adaptive (A) bandwidth ht,A = λt hrot . The idea here is to adjust
the pilot density estimate in such a way that areas of high (low) density use a
smaller (larger) bandwidth.
we assume that all lags are needed for specifying μ(·), but not necessarily for σ(·).
Moreover, we let {εt , t ≥ ip + 1} ∼ (0, 1) with finite fourth moment.
i.i.d.
μ) = E[{Yt − μ
FPE( t )}2 W (X
(X M,t )], (9.39)
where X M,t = (Yt−1 , . . . , Yt−M ) (M ≥ ip ) is the full lag vector process, and W:
R → R is a suitably chosen weight function (usually a 0 – 1 function with compact
M
support). Similar as AIC and its variants, the idea is to choose the lag combination
which leads to the smallest FPE(·).
Tjøstheim and Auestad (1994a) derive a stepwise FPE criterion with a penalty
term that is a complicated function of the chosen bandwidth and the selected ker-
nel. For a DGP with correct lag vector (i1 , . . . , ip ) and bandwidth h, as T → ∞,
Tschernig and Yang (2000) obtain an expression for the asymptotic FPEs (AFPEs).
Then, under some mild assumptions, and for both NW and LL estimators of μ(·),
they propose the estimated FPEs
FPE(h,
i1 , . . . , ip ) = AFPE(h, i1 , . . . , ip ) + o h4 + (T − ip )−1 h−p , (9.40)
9.1 KERNEL-BASED NONPARAMETRIC METHODS 359
are given by
in which the AFPEs
p
AFPE(h, h + 2{K(0)} B
i1 , . . . , i p ) = A h , (9.41)
opt
(T − ip )hopt
where, at XM,t = xM ,
1
T
1
T
Wt (xM )
h =
A {Yt − μ 2
(Yt )} Wt (xM ), Bh = {Yt − μ
(Yt )}2 ,
T − ip T − ip fhopt (Yt )
opt
t=ip +1 t=ip +1
(9.42)
and where A
h and fh (a kernel-based estimator of the density function f (y))
opt opt
are evaluated at the optimal bandwidth hopt , while B h uses any bandwidth of order
(T −ip )−1/(p+4) . For a second-order Gaussian kernel hopt is given as the rule-of-thumb
(rot) bandwidth hrot = σ Y {4/(p + 2)}1/(p+4) T −1/(p+4) , and K(0) = (2π)−1/2 .
Tschernig and Yang (2000) show that conducting lag selection on the basis of
(9.40) is consistent if the underlying DGP is nonlinear. Nevertheless, they find that
criterion tends to select too many lags in general, and suggest a correction
the AFPE
to reduce the chance of overfitting. The resulting estimate of the corrected AFPE
(CAFPE) is given by
= AFPE{1
CAFPE + p(T − ip )−4/(p+4) }. (9.43)
Fukuchi (1999) introduces a consistent CV-type method for checking the ad-
equacy of a chosen lag vector, albeit in a linear parametric model setting. The set
of candidate models can be correctly or incorrectly specified, nested or nonnested.
The method also provides a valid approach for selecting the correct lag vector in
(9.35). It uses a measure of forecast risk for each set of one-step ahead forecasts, with
the forecast risk estimated from a growing subsample of the original series {Yt }Tt=1 .
Specifically, in the first step the data set is split into a sample for estimation that
contains the first R values (R ≤ T − 1). The remaining T − R observations are
used to forecast YR+1 , say YR+1 . Next, the one-step ahead forecast YR+2 of YR+2 is
based on the sample {Yt }R+1
t=1 . This procedure is repeated until the one-step ahead
forecast of YT is based on T − 1 observations. The so-called rolling-over, one-step
ahead MSFE is
−R
T
1
MSFE = {Yt+R − Yt+R }2 . (9.44)
T −R
t=1
The selected subset lag vector is the one giving the smallest MSFE. Clearly, if a lag
the above method can
selection is carried out for each forecast using e.g. CAFPE,
be computationally demanding.
Example 9.4: Canadian Lynx Data (Cont’d)
Consider the log10 -transformed Canadian lynx data introduced in Section 7.5
(T = 114). Based on the LL nonparametric estimation method with a Gaus-
sian kernel, we conduct a full search over a wide set of lag combinations with
360 9 SEMI- AND NONPARAMETRIC FORECASTING
M = 15. The maximum number of lags entertained in the state vector is set
and CAFPE
at 4. Both methods, AFPE select the lag vector (1, 2, 10, 11)
as the optimal one. Comparing this result with the specified lags of the five
fitted models in Table 7.5, it is clear that a subset NLAR with only these four
lags might be sufficient in describing the data. In fact, the residual variance in
both cases is 0.0271 which is considerably lower than the corresponding values
reported in the last column of Table 7.5.
Using the rolling-over, one-step ahead forecasts of the last 12 observations, we
obtain a MSFE of 0.0165, with the pre-set lag vector (1, 2, 10, 11). This MSFE
value remains the same if we apply the CAFPE-based criterion using the initial
estimation sample up to and including time t = 102, and then maintain the
selected lag vector for all remaining periods. If, however, we apply the CAFPE
lag selection criterion for each forecast separately, the overall MSFE is 0.0087.
In this case, the forecasts are based on the selected lag vector (1, 2, 10, 11)
for subsamples of observations up to and including time t = 102, 110, 111, and
112, and on the lag vector (1, 2, 3, 4) for subsamples of observations up to
and including t = 103, . . . , 109.
θ(Yt ) = φ(Xt ) + εt
p
= φi (Yt−i ) + εt , (9.45)
i=1
where θ(·) and φi (·) are smooth real-valued, but unknown, functions. For identifi-
ability reasons, we usually require that E[φi (Yt )] = 0.
9.2 SEMIPARAMETRIC METHODS 361
The objective is to find the optimal transformations θ(·) and φ(·) of Yt and
Xt , respectively, such that the squared-loss regression function
E[θ(Yt ) − φ(Xt )]2
E[θ2 (Yt )]
is minimized over all smooth real-valued functions θ(·) and φ(·). Clearly, if we fix
φ(Xt ), the solution of θ(Yt ) is the conditional expectation θφ (Yt ) = E[φ(x)|Yt ]/
E[φ2 (Xt )]
. If we fix θ(Yt ), then the solution of φ(Xt ) is φθ (Xt ) = E[θ(Yt )|Xt ].
Assume that the joint distribution of the stochastic processes {Yt } and {εt } is known.
Then, combining the above steps, leads to an iterative procedure for finding the
optimal transformation in the sense of minimizing the LS errors, that is
#
p
$
arg min E[θ(Yt ) − φi (Yt−i )]2 , (9.46)
θ,φ
i=1
where, to avoid the trivial solution θ(·) ≡ φi (·) ≡ 0, we set E[θ2 (Yt )] = 1.
In applications, the conditional expectations in (9.46) are replaced by suitable
estimates obtained from the data. More specifically, within a time series context,
the ACE algorithm works as follows.
(iv) Alternate: Do steps (ii) and (iii) until a convergence criterion is reached.
The resulting functions θ∗ (·), φ∗1 (·), . . . , φ∗p (·) are then taken as estimates of
the corresponding optimal transformations.
For time series data, convergence may be slow due to the correlated nature of
the observations. Also, if {Yt , t ∈ Z} is close to unit root nonstationarity in the
sense that the lag one serial correlation is close to unity, then the ACE algorithm
tends to suggest linear transformations for Yt−1 . Nevertheless, the ACE procedure
will converge to the optimal solution asymptotically, provided the serial dependence
decays sufficiently fast. Besides, the ACE algorithm can handle variables other
than continuous predictors such as categorical (ordered or unordered), integer, and
indicator variables.
362 9 SEMI- AND NONPARAMETRIC FORECASTING
AVAS
AVAS differs from ACE in that θ(·) is selected so that Var{θ(Yt )| iφi (Yt−i )} is
constant. This modification removes the problem with heteroskedasticity which lies
at the root of the ACE difficulties in multiple regression. It is known that if a random
variable Z has mean μ and variance
+1 V (μ), then the asymptotic variance stabilizing
transformation for Z is h(t) = 0 V (s)−1/2 ds. The resulting AVAS algorithm is like
Algorithm 9.4 except that in step (iii) it applies the estimated variance stabilizing
before standardization.
transformation to θ(·)
AVAS can be viewed as a generalization of the Box–Cox ML procedure for choos-
ing power transformations of the response, Yt . It also generalizes the Box–Tidwell
procedure for choosing transformations of the predictor variables, Yt−1 ,Yt−2 , . . . ,Yt−p .
Both ACE and AVAS are useful primarily as exploratory tools for determining which
of the response Yt and the predictors Yt−1 , . . . ,Yt−p are in need of nonlinear trans-
formations and what type of transformation is needed.
Since both the ACE and AVAS algorithms are based on smoothing methods,
prediction of θ(Yt ) based on the conditional mean function may be carried out in a
manner similar to the simple kernel regression case. For example, to predict θ(YT +1 )
as a function of p lagged values of the series, the functions φi (YT +1−i ) (i = 1, . . . , p)
are estimated separately as
n ∗
t=1 K{(x − Yt )/h}Yt
φi (x) = n , (n = T − i), (9.47)
t=1 K{(x − Yt )/h}
where x = YT +1−i and Yt∗ = Yt+i . Then the one-step ahead forecast of θ(YT +1 ) is
p
T +1 ) =
θ(Y φi (YT +1−i ). (9.48)
i=1
Figure 9.5: Thirty years of daily sea surface temperatures (SSTs) in ◦ C at Granite Canyon
California measured from March 1, 1971 – April 30, 1991; T = 7,361.
M
Yt = β0 + βi φi (αi Xt ) + εt , (9.49)
i=1
364 9 SEMI- AND NONPARAMETRIC FORECASTING
Figure 9.6: Estimated additive functional relationships between SSTs and transformed
lagged SSTs obtained using ACE.
has the best predictive power, in terms of lowest MSFE. Each φi (·) is estimated
nonparametrically using a kernel-based smoothing method such that E[φi (αi Xt )] =
0 and Var[φi (αi Xt )] = 1.
Model (9.49), with p > 1, is a generalization of the original PPR model in-
troduced by Friedman and Stuetzle (1981), i.e. the series {Yt }Tt=1 is modeled as a
(smooth, but otherwise unrestricted) function of a (usually) different linear com-
bination of Xt . For the case p = 1 both models have the same form, but the
estimation algorithm differs in the sense that the original PPR algorithm chooses
αi (i = 1, . . . , M ) in a forward stepwise manner. This can result in considerably
different model specifications. PPR, as specified by (9.49), is implemented in both
R and S-Plus with the constraint pj=1 αi,j 2 = 1.
φ1 (α
1 Xt ) φ2 (α
2 Xt )
1 Xt
α 2 Xt
α
Figure 9.7: Estimated functional relationships between SSTs and lagged SSTs obtained
using PPR.
Table 9.2: The estimated coefficients in the PPR model for the SSTs.
Most of the weight in the first projection vector falls on Yt−1 and the estimated
relationship is approximately linear. The second projection vector has weights
on all lagged values of Yt and the graph suggests that this projection is related
to Yt in a nonlinear fashion. The multiple R2 value is 89.05%, similar to that
for the ACE fitted model. A third-order projection makes little additional
contribution to the prediction of SSTs.
Estimation
Although nonparametric methods do not require an explicit model, the TSMARS
methodology is probably best understood through introducing the following setup.
Let {Yt , t ∈ Z} be a univariate stationary time series process that depends on p1
(p1 ≥ 0) past values of Yt and on q pi -dimensional vectors of exogenous time series
variables Xi,t = (Xi,t−1 , . . . , Xi,t−pi ) , (pi ≥ 0; i = 1 , . . . , q). Assume that there are
T observations on {Yt } and {Xi,t }, and that the data is presumed to be described
366 9 SEMI- AND NONPARAMETRIC FORECASTING
The goal is to construct a function μ (·) that can serve as a reasonable approximation
of μ(·) over the domain D.
We introduce the (TS)MARS methodology by first discussing the method of
recursive partitioning. Let {R(s) }Ss=1 be a set of S disjoint subregions representing a
partitioning of D. Given these subregions, recursive partitioning approximates the
unknown function μ(·) at Wt = (1, Yt−1 , X , . . . , X ) in terms of basis functions
1,t q,t
Bs (·) so that
S
(Wt ) = β0 +
μ βs Bs (Wt ), (9.51)
s=1
Figure 9.8: Pair of one-dimensional basis functions used by the MARS method; (x−0.5)+
(left panel) and (0.5 − x)+ (right panel).
Here, β0 is the coefficient of the constant basis function B0 (Wt ) = 1, and the
sum is over all remaining basis functions produced by the forward step that survive
the backwards deletion step, uks = ±1 and indicates the (left/right) sense of the
associated step function. The quantity Ks is the number of factors or splits that
give rise to the sth basis function Bs (·). The subscript v(ks), t (v = 1, . . . , n) labels
the predictor variables at time t (t = 1, . . . , T ), and the t∗k,s represent values on the
corresponding variables.
Model selection
To evaluate the GOF and compare partition points, (TS)MARS uses residual squared
errors in the forward step. In the backward step, it uses a modified generalized CV
(GCV) criterion that requires only one evaluation of the model and hence reduces
some of the computational burden of (TS)MARS. That is
3 C(M ) 2
ε2
GCV(M ) = σ 1− , (9.53)
T
where σε2 = T −1 Tt=1 {Yt − μ
M (Wt )}2 is an estimate of σε2 , measuring the lack-
of-fit to the training data. The term in the denominator of (9.53) penalizes over-
parameterization, with
Figure 9.9: (a) Five years of daily SSTs (◦ C); (b) Wind speed (in knots) at Granite
Canyon; T = 1,825.
On the other hand, the SETAR model allows for different error variances in different
regimes, whereas homogeneity of error variances is assumed in TSMARS.
Forecasting
Forecasts for TSMARS models that involve no stochastic exogenous covariates may
be obtained in two ways – iteratively or directly. Given Yt+1−j , (j = 1, . . . , p), an
iterated forecast of Yt+H (H ≥ 1) is computed as
(Yt+H−1|t , . . . , Yt+H−p|t ),
Yt+H|t = μ (9.54)
where Yt+H−j|t = Yt+H−j when H − j < 0, beginning with Yt+1|t . This is analogous
to the iterative prediction of a parametric AR model, as μ (·) can be considered as
a parametric spline function. Direct forecasts of Yt+H can be obtained by fitting
a TSMARS model using only values of the series at lags greater than or equal to
H as predictors, e.g., Yt+H|t = μ(Yt+H−1 , . . . , Yt+H−p ). This is analogous to the
methods of forecasting for kernel-based regression models. Using the direct method,
a different model should be estimated for each value of H to be forecast, as the
TSMARS model is selected to minimize a function of the forecast errors.
inhomogeneity in the series. Also, they remove a one-year cycle from the data
before model fitting, i.e. Yt = log SSTt −{b0 +b1 sin(2πt/365)+b2 cos(2πt/365)}
with LS estimates b0 = 2.4826, b1 = −0.0907, and b2 = 0.0460. Further, they
recode the WDt series as a categorical variable representing the following four
wind directions: 1 = East; 2 = North; 3 = West; and 4 = South. Days with
no wind or only light airs receive a code of 5. The resulting fitted model is:
⎧ 2.19
(0.00) +0.88(0.01) (Yt−1 −2.13)+ + 1.62(0.28) (2.22−Yt−34 )+
⎪
⎪
⎪
⎪
⎪
⎪ (WS −1.10) I(WD ∈ {1, 2})
⎪ +0.01
⎪ −0.04(0.00) (WSt−1 −1.10)+ I(WDt−1 ∈ {2, 3})
⎪
⎪
⎨ (0.00) t−1 + t−1
9.2.4 Boosting
Boosting is a semiparametric forward stagewise algorithm that, in a time series
context, iteratively estimates a multivariate nonlinear additive AR model, with or
without exogenous variables. Let {Yt , t ∈ Z} be a univariate stationary time series
process which depends on the (q+1)p-dimensional vector Wt = (W1,t , . . . , W(q+1)p,t )
370 9 SEMI- AND NONPARAMETRIC FORECASTING
= (Yt , X1,t , . . . , Xq,t ) ∈ R(q+1)p , where Yt−1 = (Yt−1, . . . , Yt−p ) is the p-dimensional
vector of lagged values, and Xi,t = (Xi,t−1 , . . . , Xi,t−p ) (i = 1, . . . , q) the q p-
dimensional vectors of explanatory variables. Similar as with (TS)MARS, the goal
is to obtain an estimate, or approximation, μ (·) of the regression function μ(·) ≡
E(Yt |Wt = w). For a sample of T observations, this approximation comes down to
minimizing the expected value of a loss function, say L(·), over all values {Yt , Wt }Tt=1 .
A common procedure that solves the above problem, and facilitates interpreta-
tion, is to restrict μ(·) to be a member of a parameterized class of functions μ(·; β).
To be specific, we reformulate the original function optimization problem as a para-
meter optimization problem, i.e.
μ
(Wt ) ≡ μ(Wt ; β), (9.56)
where
T
with L(·) a loss function which is assumed to be differentiable and convex with
respect to the second argument. Two frequently used loss functions are the L2 loss,
and the absolute error or L1 loss. The final solution is given by
M
μ(Wt ; β[M ] ) = [m] ).
νh(Wt ; γ (9.58)
m=0
Here h(·), termed a weak learner or base learner, is characterized by the mth estimate
[m] of an M -dimensional parameter vector γ; ν ∈ (0, 1) is a shrinkage parameter;
γ
and γ [0] is an initial guess of γ. Thus, the underlying structure in the parameters is
assumed to be of an “additive” form
M
β[M ] = [m] .
νγ
m=0
The shrinkage parameter ν can be regarded as controlling the learning rate of the
boosting procedure. It provides the base learner to be “weak” enough, i.e. the base
learner has large bias, but low variance.
Now, solving (9.58) directly is infeasible. One practical way to proceed is to
use greedy (stepwise) optimization to estimate the additive terms one at a time.
Jointly with a steepest-descent step, the resulting (generic) algorithm, called gradient
descent boosting, can be summarized as follows.
9.2 SEMIPARAMETRIC METHODS 371
(iii) Perform a simple regression of the weak learner on the negative gradient
2
[m] = arg minγ Tt=1 g [m] (Wt ) − h(Wt ; γ) .
vector, i.e. γ
(v) Iterate steps (ii) – (iv) until m = M , where M may be chosen by GCV, as
in (TS)MARS, or an AIC-type stopping criterion, e.g., AIC c .
High-dimensional models
For regression problems with a large number of predictor variables, Bühlmann
(2006) proposes component boosting, where the base learner is applied to one
variable at a time. The simplest weak learner is linear. For this learner γ [m] =
(0, . . . , 0, γ
s
m , 0, . . . , 0) ∈ R(q+1)p where sm ∈ {1, 2, . . . , (q + 1)p} denotes the re-
spective component at the mth boosting iteration. Then, for componentwise L2
boosting, the modification of h(·) in Algorithm 9.5 is as follows:
[m] ) = Wt γ
h(Wt ; γ [m] ,
j = OLS{γj },
γ ∀j ∈ J ≡ {1, 2, . . . , (q + 1)p}, (9.59)
T
2
sm = arg min [j] ) ,
g [m] (Wt ) − h(Wt , γ (9.60)
j∈J
t=1
where OLS{γj } denotes the ordinary LS estimator of γj with the negative gradient of
the loss function as a T -dimensional pseudo-response vector. The resulting algorithm
is called generalized linear model boosting (glmboost).
372 9 SEMI- AND NONPARAMETRIC FORECASTING
Table 9.3: Comparison of MSFEs for H = 1, 4, 8, and 12-steps ahead predictions made
with glmboost, gamboost, BRUTO, and MARS for the quarterly U.S. unemployment rate.
For each H, blue-typed numbers indicate the lowest MSFE.
Figure 9.10: Boxplots of the averaged squared forecast errors, based on 21 forecasts, for
H = 1, 4, 8, and 12-steps ahead forecasts of the quarterly U.S. unemployment rate.
Table 9.4: Selected lags for the first quarter of 2007 when the available information set
reaches its maximum; Quarterly U.S. unemployment rate.
where {εt } ∼ (0, σε2 ) with εt independent of Yt−i ∀i > 0. Model (9.61) is a special
i.i.d.
case of the state-dependent model (2.10), hence has all the nice properties of a SDM.
The model encompasses the SETAR and STAR models. A direct generalization
follows from introducing functional-coefficient MA terms (Wang, 2008). If d > p,
a coefficient term φ0 (Yt−d ) may be included in the model. For d = p, such a term
creates ambiguity and is generally omitted. Clearly, the FCAR(p) model forces each
function of Yt−i (i = 1, . . . , p) to be of the form φi (Yt−d )Yt−i , whereas the more
general NLAR model allows φi (·) to vary freely.
The functional form of the coefficients can be simply estimated at time t using an
arranged local regression with a fixed-length moving window, and a minimum data
size. The resulting estimates φi (·) of φi (·) are consistent under geometric ergodicity
conditions (Chen and Tsay, 1993b). By plotting φi (·) versus the threshold variables
Yt−i (i = 1, . . . , p) one may infer good candidates for the functional form.
Generalized FCAR
Cai et al. (2000b) propose a generalized FCAR(p) model, given by
where X ∈ Rq can consist of possibly more than one lagged value of the time series
process {Yt , t ∈ Z} or some other exogenous variable. In addition, the Zi,t (i =
1, . . . , p) can be lagged values of {Yt , t ∈ Z} or can be a different exogenous variable,
9.2 SEMIPARAMETRIC METHODS 375
although commonly Zi,t = Yt−i is used. The φi (·) are assumed to have a continuous
second derivative. The functional form can be estimated nonparametrically using
kernel-based methods. In this sense, analysis of (9.62) may be thought of as a
hybrid of parametric and nonparametric methods. In the following, we discuss the
LL smoother for the case q = 1.
Let {Yt , Xt , Zt = (Z1,t , . . . , Zp,t ) }Tt=1 denote process observations. We approx-
imate φi (·) locally at a point x0 ∈ R as φi (x) ≈ ai + bi (x − x0 ). Then (ai , bi ) are
estimated to minimize the weighted sum of squares
T
p
# $ 2
Wt Yt − ai + bi (x0 − Xt ) Zi,t , (9.63)
t=1 i=1
where n denotes the length of the sth subseries of {Yt } (s = 1, . . . , S), and the
φi,s (·) are computed from the series up to observation T − sn using bandwidth
hT = [T /(T − sn)]1/5 . The optimal bandwidth is defined to minimize
S
MSFE(hT ) = MSFEs (hT ). (9.65)
s=1
This measure can be regarded as a modified form of multifold CV, appropriate for
stationary time series processes. In practical applications, it is recommended to set
n = [0.1T ] and S = 4. The same criterion can be used to select among different X
and different model orders, p.
Model assessment
Cai et al. (2000b) propose a bootstrap LR-type test for FCAR models to determine
whether the coefficient functions are constant or take a particular parametric form.
Suppose, for some parameter vector θ ∈ Θ, where Θ denotes the space of allowed
values of θ, we have the null hypothesis
(ii) Estimate the FCAR model nonparametrically and construct the residuals,
p T
εt = Yt − i=1 φi (x)Zi,t and the residual sum-of-squares, RSS 1 = t=1 εt2 .
(iv) Generate the bootstrap residuals {ε∗t } from the EDF of the centered residuals
{
εt − ε} from the nonparametric FCAR-fit, and construct bootstrap process
p i,t + ε∗ . (Note that if z or Zi,t are functions
values as Yt∗ = i=1 φj (x; θ)Z t
of the original {Yt , t ∈ Z} process, the original values are used, not values
obtained from the bootstrapped process. This corresponds to a fixed-design
nonparametric regression method.)
(0)
(v) Compute the test statistic LR T based on the bootstrapped sample in the
same way as (9.66).
∗,(b) B
(vi) Repeat step (v) B times, to obtain {LRT }b=1 .
1+ I LRT ≥ LRT
p = b=1
.
1+B
Note that the above test statistic can be used to test for constant coefficients
= φi . The residuals are bootstrapped from the nonparametric
by letting φi (x; θ)
fit to ensure that the estimated residuals are consistent, no matter whether the null
hypothesis or the alternative hypothesis is correct.
Figure 9.11: (a) Plot of the MSFE versus hT for estimation of model (9.67); (b)–(f )
Estimated functional-coefficients φi (·) in model (9.67) for the quarterly U.S. unemployment
rate.
figures indicate that most functions φi (·) are either quadratic or sine func-
tions. Finally, we apply the bootstrap LR-type test statistic in Algorithm 9.6
(500 replicates) to test (6.15) against the FCAR model in (9.67). The p-value
is 0.00, which reinforces that the three-regime SETAR model is adequate with
a residual variance σε2 = 0.26 × 10−2 versus σ ε2 = 0.43 × 10−2 for the fitted
FCAR model.
p
where {εt } ∼ (0, σε2 ) with εt independent of X and Yt−i ∀i > 0. Here, φi (·) (i =
i.i.d.
matrix, are used for estimation. Xia et al. (1999) suggest selecting θ and hT using
a leave-one-out CV method, as follows.
Algorithm 9.7: Estimating θ and hT for the single-index model
(i) For a range of θ and hT values, compute
2
hT ) =
S(θ, g(Xt ; θ) Yt ,
Yt − Φ (9.69)
θ,t
Xt ∈A
where Φ θ,t (·) denotes the LL regression estimate of Φθ (x) = (φ0,θ (x), . . . ,
φp,θ (x)) obtained using kernel regression when the point (Yt , Xt ) is omitted
from the data.
1
θ,
ε2 = T
σ S( hT ),
t=1 I{Xt ∈ A}
where (θ, hT ) is a pair of solutions.
Xia et al. (1999) prove the asymptotic normality of the estimator of θ and the
consistency of the estimators for φi (·) under some regularity conditions. They also
show that the estimated bandwidth, hT , is asymptotically efficient and is propor-
tional to T −1/5 .
where {εt }, {Xt } ∼ N (0, 1), and {εt } and {Xt } are mutually independent
i.i.d.
where
Table 9.5: Sample mean and standard deviation (in parentheses) of estimated θ, β and
σε2 for different sample sizes T ; based on 1,000 MC replications.
T θ
β
ε2
σ
50 0.7978 (0.0331) 0.5994 (0.0569) 0.4427 (0.0447) -0.5875 (0.0593) 0.0261 (0.0189)
100 0.7987 (0.0170) 0.6010 (0.0226) 0.4484 (0.0212) -0.5959 (0.0219) 0.0194 (0.0158)
200 0.7996 (0.0091) 0.6003 (0.0122) 0.4492 (0.0112) -0.5983 (0.0112) 0.0169 (0.0072)
θ Xt
Figure 9.12: Simulation result from a typical data set of size T = 200. The blue sold line
denotes the estimated nonlinear relation between Yt and θ Xt . Black dots denote Yt − β Xt
against θ Xt . The red solid line denotes the real nonlinear part of relation (9.70).
confirms the theoretical results; stable estimates of θ, β, and σε2 are obtained
even for T = 50. Figure 9.12 shows the estimated nonlinear relation between
Yt and θ Xt from a typical simulated data set of size T = 200. We see that
the estimated function (blue solid line) is relatively close to the real one (red
solid line).
Table 9.6: Some applications of semi- and nonparametric methods to univariate time
series.
9.1.1 Mean, Mdn, Mode De Gooijer and Zerom (2000) U.S. weekly T-bill rate
9.1.4 k-NN Lall and Sharma (1996) Monthly streamflow data
Rajagopalan and Lall (1999) Daily weather data
Loess Barkoulas et al. (1997) U.S. quarterly T-bill rate
9.2.1 ACE, BRUTO Chen and Tsay (1993a) Daily river flow data
BRUTO Shafik and Tutz (2009) Monthly unemployment index
9.2.2 PPR Xia and An (1999) Australian blowfly data
Lin and Pourahmadi (1998) Canadian lynx data
9.2.3 TSMARS Lewis and Stevens (1991) Annual sunspot numbers
Lewis and Ray (1997) Daily sea surface temperatures
Chen et al. (1997) Eight environmental time series
De Gooijer et al. (1998) Weekly exchange rates
9.2.4 Glmboost/Gamboost Robinzonov et al. (2012) German monthly industrial production
Glmboost Buchen and Wohlrabe (2011) U.S. monthly industrial production
density. The bandwidth selection rule optimizes the estimated conditional density by min-
imizing the ISE. Fan et al. (1996) provide a similar, but ad – hoc method. McKeague and
Zhang (1994) study cumulative versions of one-step lagged conditional mean and variance
functions.
Section 9.1.6: Early studies on CV nonparametric lag selection consider functional re-
lationships with conditional homoskedasticity; see, e.g., Cheng and Tong (1992), Yao and
Tong (1994), and Vieu (1994, 1995). Guo and Shintani (2011) investigate the properties
of the FPE lag selection procedure for nonlinear additive AR models. Also, there is an
extensive literature on CV methods for the simultaneous selection of the parametric and
nonparametric components in a partially linear model; see, e.g., Gao and Tong (2004), and
Avramidis (2005) and the references therein. Chen et al. (1995) propose three procedures
for testing additivity in nonlinear ARs of the form (9.45).
Section 9.2.1: The ACE and AVAS algorithms were originally introduced for regression
modeling by Breiman and Friedman (1985); see also Hastie and Tibshirani (1990) and
Tibshirani (1988).
Section 9.2.2: Following Hall (1989), a kernel-based PPR estimation method for time series
has been proposed by Xia and An (1999), and applied to real data. Granger and Teräsvirta
(1992b) report results of a small experiment in which linear models, PPR models, and models
containing both linear and PPR terms are fitted to nonlinear time series under a variety of
signal to noise cases. They conclude that when nonlinearity is strong, PPR models fit and
forecast quite well, but tend to overfit the data when nonlinearity is weak.
Section 9.2.3: Lewis and Ray (2002) use TSMARS to model nonlinear threshold-type AR
behavior in periodically correlated time series. A Bayesian nonparametric implementation
of nonlinear AR model fitting using splines has been discussed by Wong and Kohn (1996).
A Bayesian implementation of MARS, with application to time series prediction, has been
given by Denison et al. (1998). In both cases, Bayesian estimation is carried out by MCMC
methods. These methods generate enormous combinations of basis functions from which it
is difficult to extract information on the regression structure. Sakamoto (2007) solves this
problem by proposing an empirical Bayes method to select basis functions and the position
of the knots. Porcher and Thomas (2003) propose a penalized least squares approach to
order determination in TSMARS.
Section 9.2.4: Robinzonov et al. (2012) perform a nonlinear time series Monte Carlo
comparison of glmboost, gamboost, TSMARS, BRUTO, and an algorithm due to Huang
and Yang (2004). These latter authors use a stepwise procedure for the identification of
nonlinear additive AR models based on spline estimation and BIC. Robinzonov et al. (2012)
conclude that boosting is superior to its rivals in discovering the true nonlinear DGP. From
a computational point of view, Schmid and Hothorn (2008) advocate the use of component
P-splines based learners with the shrinkage parameter vector estimated via penalized least
squares; see also Shafik and Tutz (2009) for the corresponding boosting algorithm. Some
ideas to address the multivariate generalization of boosting are provided by Lutz et al.
(2008). Assaad et al. (2008) adopt the boosting algorithm for predicting future time series
values using recurrent NNs as base learners. For an overview on boosting in general, we
refer to Bühlmann and Hothorn (2007).
Section 9.2.5: Chen and Liu (2001) place the estimation of (9.61) in the smoothing con-
text, proposing an LL regression estimate of φi (·) (i = 1, . . . , p). In addition, these authors
give two test statistics. One for assessing whether all the coefficient functions are con-
stant. The second one tests if all the coefficient functions are continuous. A small MC
384 9 SEMI- AND NONPARAMETRIC FORECASTING
simulation study complements the paper. Chen and Wang (2011) investigate some prob-
abilistic properties (stationarity and invertibility) of combined AR–FCMA models. Chen
and Huo (2009) provide an approach that generalizes smoothing splines to high dimensions
(> 3 covariates) and is relatively free from formulational assumptions such as the restric-
ted number of covariates in the FCAR models; MATLAB and R codes are available at
http://www.tandfonline.com/doi/suppl/10.1198/jcgs.2009.08040?scroll=top.
Matsuda (1998) proposes an alternative GOF test statistic to determine whether the coeffi-
cient functions are constant or take a particular parametric form. Although the test statistic
has asymptotically a χ2 distribution under certain regularity conditions, he finds that a boot-
strap method provides better significance levels in practice. Cai et al. (2000a) provide details
of estimating varying-coefficient models in a regression setting. Cai et al. (2009) consider
the estimation of a generalized functional coefficient regression model with nonstationary
covariates.
Section 9.2.6: Wu et al. (2011) recommend to estimate the univariate varying-coefficient
functions in the single-index model by P-splines. This approach provides an explicit fit
which allows the authors to conduct multi-step ahead out-of-sample forecasting. The paper
includes implementation details of the proposed estimation algorithm. Wu et al. (2010)
introduce LL estimation for quantile regression via single-index models as well as some
computational algorithms.
Software References
Section 9.1: A kernel smoothing MATLAB toolbox is available at http://nl.mathworks.
com/matlabcentral/linkexchange/links/3551-kernel-smoothing-toolbox as a part
of the book by Horová et al. (2012). The toolbox contains menu-driven functions for the
estimation of: univariate densities, distribution functions, quality indices, hazard functions,
regression functions, and multivariate densities. Various alternative software codes can be
downloaded from MATLAB Central. For instance, ksr (Gaussian kernel smoothing regres-
sion), ksrlin (local linear Gaussian kernel regression), and smoothing (Nadaraya–Watson
smoothing with GCV). Also, several R packages for kernel smoothing are available. For in-
stance, ksmooth {stats} (NW estimator (local constant fit), univariate x only, no automatic
bandwidth selection), and sm (nonparametric smoothing methods described in Bowman and
Azzalini (1997)).
KDE is a general MATLAB class for k-dimensional kernel density estimation (written in
a mix of “m” files and MEX/C++ code); see http://www.ics.uci.edu/ ~ihler/code/
kde.html. There are various R-packages available. For instance, sskernel (kernel density es-
timation with an automatic bandwidth selection), gkde (Gaussian kernel density estimation
with bounded support), kerdiest (kernel estimators of the distribution function and related
functionals, with several CV bandwidth methods), KernSmooth (local linear or quadratic
kernel smoothing; up to bivariate density estimation with restricted bandwidths; see Wand
and Jones (1995)), and ks (kernel smoothing; kernel density estimation; kernel discriminant
analysis; two- to six-dimensional data; general bandwidths).
An extensive set of semi- and nonparametric methods comes with the interactive commercial
statistical computing environment XploRe. Using this software, it is easy to reproduce many
of the examples in the book by Härdle (1990). XploRe is not sold anymore. However, the last
version, 4.8, can be freely downloaded from the website http://sfb649.wiwi.hu-berlin.
de/fedc_homepage/xplore.php.
MATLAB code (mean median.m) for obtaining the conditional mean and the conditional
median forecasts, using single- and multi-stage methods, can be downloaded from the website
of this book. The solutions manual (Exercise 9.4) contains MATLAB code for computing
the conditional mean, median, and mode.
Section 9.1.4: The R-packages knn, class, and FNN (fast nearest neighbor) contain k-
NN implementations. A related package is knnflex; see http://cran.r-project.org/
src/contrib/Archive/knnflex/. The R-kknn package performs weighted k-NN. The R-
KODAMA (KnOwledge Discovery by Accuracy MAximization) package contains the function
KNN.CV which performs a 10-fold CV bandwidth selection on a given data set using k-NN.
The MATLAB function knn.m is available at MATLAB Central. Related MATLAB func-
tions are kNearestNeighbors, knnsearch, and knnclassify. Alternatively, a MATLAB package
for obtaining one-step ahead k-NN forecasts is available at https://sites.google.com/
site/marceloperlin/.
The working paper “Computing nonparametric functional estimates in semiparametric prob-
lems” by Miguel A. Delgado ( http://orff.uc3m.es/bitstream/handle/10016/5821/
we9217.PDF) offers a set of FORTRAN77 routines including k-NN, kernel regression with
symmetric and possibly non-symmetric kernels, and nonparametric k-NN regression.
The Loess/Lowess methodology of Cleveland (1979) is implemented in the R (S-Plus) func-
tions lowess and loess including their iterative robust versions. The loess function (local
linear or quadratic fits, multivariate x’s, no automatic bandwidth selection) is more flexible
386 9 SEMI- AND NONPARAMETRIC FORECASTING
and powerful. For large sample sizes, however, the computations can be time-consuming.
Cleveland et al. (1990) develop a seasonal adjustment algorithm based on robust loess. It is
implemented in the R (S-Plus) function stl. The curve fitting toolbox in MATLAB contains
the function smooth with the loess/lowess methods and their robust variants.
Section 9.1.5: Ox code (Test-Algorithm-93.ox) for obtaining the Markov forecast densities,
as summarized in Algorithm 9.3, is available at the website of this book.
CAFPE,
Section 9.1.6: The lag selection methods AFPE, and MSFE are options within
the freely available computer package JMulTi, a JAVA application designed for the specific
needs of time series econometrics. The package can be downloaded from http://www.
jmulti.de/download.html.
Section 9.2.1: ACE and AVAS are implemented in the R-acepack package. S-Plus has
an implementation of both algorithms too, called ace and avas, respectively. The FOR-
TRAN77 source codes of Friedman’s ACE algorithm, and PPR are available from http:
//www-stat.stanford.edu/ ~jhf/ftp/progs/. A FORTRAN90 version of the ACE al-
gorithm (mace.f90) can be downloaded from Alan Miller’s FORTRAN software webpage at
http://jblevins.org/mirror/amiller/. The MATLAB–ACE algorithm, using adaptive
partitioning to calculate the conditional expectations, and the supersmoother algorithm are
available from the MATLAB archive. The function areg in the R-Hmisc package offers the
option to control the smoothness of the transformation in ACE.
Section 9.2.2: PPR is implemented in the R-stats package as the function ppr, and within
S-Plus it is called ppreg. Both functions are based on the so-called smooth multiple additive
regression technique (SMART) of Friedman (1984). As explained in Section 9.2.2, SMART
modeling is a generalization of PPR (Friedman and Stuetzle, 1981).
Section 9.2.3:
MARS and BRUTO are provided in the R-mda package. A new, slightly more flexible altern-
ative implementation of MARS (fast MARS) is in the R-earth package. A commercial version
of MARS is available from http://www.salford-systems.com/products/mars. ARESLab
is an Adaptive Regression Splines toolbox for MATLAB/Octave, which can be downloaded
from Gints Jekabsons’ webpage at http://www.cs.rtu.lv/jekabsons/regression.html.
Section 9.2.4: There are several implementations of boosting techniques, available as add-
ons for R. Both procedures glmboost and gamboost are contained in the packages mboost
and GAMBoost. The first package provides an implementation for fitting GLMs, as well
as additive gradient-based boosting. GAMBoost contains an implementation of likelihood
boosting as proposed by Tutz and Binder (2006).
Section 9.2.5: The results in Example 9.9 were obtained with the S-Plus code to accom-
pany the book by Fan and Yao (2003); see http://orfe.princeton.edu/ ~jqfan/fan/
nls.html.
Section 9.2.6: The simulation results in Example 9.10 were obtained using SAS code, 3
provided by Yingcun Xia. The epls.sas code is available at the website of this book.
3
SAS is a registered trademark of SAS Institute, Inc.
EXERCISES 387
Exercises
Empirical and Simulation Questions
9.1 Consider the NLAR(1) process
i.i.d.
Yt = sin(Yt−1 ) + εt , {εt } ∼ N (0, 1).
(a) The file Yt-n500-sinus.dat contains T = 500 simulated data points from the above
process. Compute the NW local constant smoother μ NW
h (x) with x = Yt−1
equally spaced in the range [−2, 2], h ≡ hT = 0.02, and with the Epanechnikov
kernel (see Table 7.7). If h → 0, what happens with μNW
h (·)?
hLL (·), with a Gaussian kernel.
(b) Repeat part (a) using the local linear smoother μ
(c) Plot both kernel regression estimates jointly with the true regression function,
and the generated data. Comment on the results.
(d) Repeat part (a) using the plug-in bandwidth hrot = σ Y T −1/5 . Compare all
kernel regression estimates. Is there any observable difference? Why?
[Hint: Use the MATLAB-ksrgress function or a similar interactive package.]
9.2 A simple algorithm (Jaditz and Sayers, 1998) for NN estimation of μ(x) = E(Yt+1 |Xt =
x) goes as follows. For a given lag length p, let {(Yt , Xt )}Tt=1 be a set of available
observations where Xt = (Yt , Yt+1 , . . . , Yt+p−1 ) . Divide the data in a prediction
set P = {(Yt , Xt ) : Nf < t ≤ T } and, for some Nf < T a fitting (training) set
F t . For each Yt ∈ P calculate the distance between Xt = x and Xi ∀i ∈ F t
using the supremum norm. Sort the data according to the distance. Then, for a
given number of NNs, select the kn (n = T − p) nearest pairs to estimate the para-
meters α0,kn and αp,kn = (α1,kn , . . . , αp,kn ) in the local linear regression model,
Y(i) = α0,kn + X(i) αp,kn + ε(i),kn with {ε(i),kn } a zero-mean WN process. Next,
use the estimated parameters α 0,kn and α p,kn to calculate the one-step ahead fore-
cast Y(i)+1|(i) = α
0,kn + X(i) α p,kn , and the associated one-step ahead forecast error
e(i)+1|(i) = Y(i)+1 − Y(i)+1|(i) . Pick the value of kn that minimizes the MSFE. Fi-
nally, given the specified number of NNs, say kn∗ , rebuild the data set to replicate the
regression. Then, in the present setting, the k-NN estimator for μ(x) is defined as
k-NN (x) = (1/kn∗ ) x(i) ∈F t ,i∈N (x) Y(i)+1 ; see (9.33) for a more general case.
μ
(a) Using your favorite programming language, write a computer code to obtain
H one-step ahead forecasts for the above k-NN regression algorithm. Include
a “robust” matrix inversion routine as a provision for near-singular matrices
X(i) X(i) .
The Great Salt Lake (GSL) of Utah is the fourth largest, perennial, closed basin,
saline lake in the world. Monthly measurements of the volume (in m 3 ) in the north
arm of the lake from October 1949 to December 2012 (756 observations) are given in
the file gsl.dat. These measurements have been investigated in an effort to understand
the dynamics of the precipitous rise of the lake during the years 1983 – 1987 and
its consequent rapid retreat; see, e.g., Lall et al. (1996) and Moon et al. (2008) for
background information on recent analyzes. Such behavior is typical of nonlinear
systems driven by large scale, persistent, climatic fluctuations.
388 9 SEMI- AND NONPARAMETRIC FORECASTING
(b) Assume the GSL time series is generated by a NLAR(2) process. Based on the
first T = 507 observations (training set) of the standardized GSL data, obtain
twelve one-step ahead forecasts. Re-estimate the model before each forecast is
computed (expanding the training set) and use the following estimation methods.
• k-NN regression with the computer code from part (a). Given a fixed sample
size n, comment on the choice of kn in the limiting case kn = n and kn = 1.
• locally constant kernel regression with a Gaussian product kernel and a
single bandwidth obtained by CV.
[Hint: Use the functions npregbw and npksum in the R-np package.]
• AVAS estimation with bandwidth obtained by CV and no weights. Com-
ment on the selected transformation of the GSL time series.
[Hint: Use the AVAS function in the R-acepack package.]
Comment on which method produces forecasts with smallest MSFEs over the
course of the year.
9.3 The data set ExpAR2.dat contains 200 simulated data points from an ExpAR(2) model
of the form
i.i.d.
Yt = {0.9 + 0.1 exp(−Yt−1
2
)}Yt−1 − {0.2 + 0.1 exp(−Yt−1
2
)}Yt−2 + εt , {εt } ∼ N (0, 1).
9.4 Consider the Old Faithful Geyser data introduced in Example 9.2. Here, we explore
some aspects of the data that were not investigated previously. In particular, we
focus on forecasting the last ten (Hmax = 10) observations of the waiting time {Yt }299
t=1
where t denotes the eruption number (geyser waiting.dat). If the time to next eruption
can be predicted accurately, visitors to the Yellowstone National Park could use this
information to organize their visit.
(a) Recall the empirical method for selecting the Markov coefficient p in (9.10). Set
pmax = 10, k = 60, and take h = σ Y T −1/(p+4) (p = 1, . . . , pmax ). Verify that for
the conditional mean the most appropriate order of the NLAR process equals
p = 1, using the function f2 (p) with {Yt }289
t=230 .
(b) Using the specification in part (a), compute the conditional mean, median, and
mode for h = 1, . . . , Hmax given the observations up to and including the waiting
time at t = 289 (Y289 = 47). Summarize the forecast performance in terms of
the MSFE and RMAFE and comment on the results.
(c) Suggest an empirical method to construct forecast intervals on the basis of the
nonparametric estimates.
EXERCISES 389
(d) Until now we have not used information on the eruption duration time. Based
on descriptive statistics and boxplots of the waiting and duration times, the
following simple (naive) forecasting rule has been suggested. 4 An eruption with
a duration < 3 minutes will be followed by a waiting time of about 55 minutes,
while an eruption with a duration > 3 minutes will be followed by a waiting
time of about 80 minutes. For the last ten observations, compare and contrast
the forecasting performance of this rule with the results obtained in part (b).
9.5 Consider the river flow data set, consisting of the hourly river flow time series {Yt }401
t=1
introduced in Example 9.3 (file name: flow.dat) and the hourly rainfall time series
{Xt }401
t=1 (file name: rain.dat). Following the forecasting procedure described in Ex-
ample 9.8, obtain forecasts Yt+H|t from past values of {(Yt , Xt )} for H = 1, 10, and
20, with the initial information set defined from t = 1 until t = 366.
(a) Use the following methods to produce the 15 out-of-sample forecasts: glmboost,
gamboost, MARS, and VAR (unrestricted). Summarize the forecasts in terms
of MSFEs and discuss the results.
[Hint: Modify the Forecasting-USunemplmnt.r function (file: example 9-8.zip),
available at the website of this book. Note, the computations can be time
demanding.]
(b) Using the four forecasting methods mentioned above, obtain the MSFEs of {Yt }
in a univariate setting. Compare your results with those obtained in part (a).
4
See Chatterjee, Handcock, and Simonoff (1995, pp. 224 – 226), A Casebook for a First Course
in Statistics and Data Analysis, Wiley.
Chapter 10
FORECASTING
requires knowledge of the conditional pdf of {Yt , t ∈ Z}, which is a substantial task
in general. The task becomes easier for a NLAR(p) model
Yt = μ(Xt−1 ; θ) + εt , (10.1)
So, for H = 1, the conditional mean is independent of the distribution of εt+1 which
is an important property for both linear and NLAR models. When H ≥ 2, however,
this is true only for linear models.
For example, the two-step ahead LS forecast for model (10.1) is given by
LS
Yt+2|t = E(Yt+2 |Xt ) = E{μ(Xt+1 ; θ) + εt+2 |Xt }
∞
= E{μ μ(Xt ; θ) + εt+1 |Xt } = μ μ(Xt ; θ) + ε)dF (ε), (10.3)
−∞
where F (·) is the distribution function of {εt }. Thus, the second term on the right-
hand side of (10.3) depends on F (·), and cannot further be reduced as in (10.2).
The reason is that, in general, the conditional expectation of a nonlinear function is
not equal to the function evaluated at the expected value of its argument.
From the above results one may erroneously conclude that it is not possible to
obtain closed-form analytical expressions for H ≥ 2 forecasts. However, by using
the so-called Chapman–Kolmogorov recurrence relationship, “exact” LS multi-step
ahead forecasts for general NLAR models can, in principle, be obtained through
complex numerical integration as we will see in Section 10.1.1 The section also
describes two “exact” forecast strategies for SETARMA models.
An alternative way to obtain more than one-step ahead forecasts, and possibly
the nearest one can get to an explicit analytical form, is a numerical approximation
(Monte Carlo simulation, bootstrap and related methods), a series expansion, or by
assuming that the innovation distribution is known. Applying these and some other
approaches, we discuss seven approximate methods for making point forecasts in
Section 10.2.
With point forecasts, the accuracy is often measured by the forecast error vari-
ance or by a forecast interval. In Section 10.3, we address the problem of construct-
ing (bootstrap) forecast intervals and regions for nonlinear and nonparametric ARs.
We make a distinction between percentile- and density-based forecast intervals. The
latter intervals are often more informative than the former when, for instance, the
forecast distribution is asymmetric or multimodal. In Section 10.4, we provide a
limited review of measures evaluating the accuracy of competing point forecasts. In
the same vein, this section gives a description of methods for interval and density
evaluation. Finally, in Section 10.5, we briefly discuss methods for optimal forecast
combination. By combining forecasts of different models/methods instead of relying
on individual forecasts, forecast accuracy can often be improved.
1
As noted above, the solution of the Chapman–Kolmogorov recurrence relationship requires
numerical integration techniques. The quotation marks around “exact” are put there to emphasize
that the numerical accuracy of H ≥ 2 forecasts depends on certain tuning parameters. For instance,
a change of variable of integration to get a finite range, and the judicious choice of weights and
abscissae of a numerical integration method.
10.1 EXACT LEAST SQUARES FORECASTING METHODS 393
where
Alternatively, this equation can be obtained by considering the joint pdf of Yt+H ,
Yt+H−1 , . . . , Yt+1 conditional on Xt = x and integrating out the unwanted vari-
ables.2 Introducing the short-hand notation fH (·) = fYt+H |Yt (·|x), equation (10.4)
immediately gives
∞
Except for some special cases of μ(·; θ), the integral equations (10.5) and (10.6)
do not readily admit explicit analytic solutions. To evaluate (10.6) numerically,
each forecasting step requires p + 1 numerical integrations. Standard numerical
integration methods can be used for this purpose, but care must be taken to handle
accumulation of rounding errors; see, e.g., Pemberton (1987), Al-Qassem and Lane
(1989), and Cai (2005).
where {εi } ∼ N (0, 1). In the sequel, ϕ(·) denotes the pdf and Φ(·) the CDF
i.i.d.
2
For economy of notation, we suppress the dimension of the information set on which the
conditional density forecast is conditioned.
394 10 FORECASTING
Figure 10.1: (a) Forecast density f (yt+H |xt ) (H = 1, . . . , 5) for the SETAR(2; 0, 0) model
(10.8); (b) Conditional mean E(Yt+H |Xt ) (H = 2, . . . , 5; α = 1).
f (yt+H |x) = w1 (β)ϕ Yt+H − I(Yt ≤ 0)α + w2 (β)ϕ Yt+H + I(Yt > 0)α ,
(10.10)
(H) (H) (H)
where w1 (β) = (1 − β H−1 )/2, w2 (β) = 1 − w1 (β), and β = 1 − 2Φ(α);
cf. Exercise 10.3. From (10.10), the conditional mean and the conditional
variance are given by respectively
E(Yt+H |Xt = x) = αβ H−1 I(Yt ≤ 0) − αβ H−1 I(Yt > 0),
Var(Yt+H |Xt = x) = 1 + α − E2 (Yt+H |Xt = x).
Note that the skewness of f (yt+H |x) is affected by both H and β which de-
(H)
termine the weights wi (β) (i = 1, 2) of the linear combination of ϕ(Yt+H +α)
and ϕ(Yt+H − α); see Figure 10.1(a). Figure 10.1(b) shows plots of the H-step
ahead conditional mean.
From (6.8), we observe that the two-regime SETARMA model can be written as
(1)
Yt = {φ0 + φ(1)
p1 (B)Yt + ψq1 (B)εt }I(Yt−d ≤ r)
(1)
(2)
LS
Yt+H|t = Yt+H|t E(It+H−d |F t ) + Yt+H|t 1 − E(It+H−d |F t ) , (10.13)
(i)
where Yt+H|t is the ARMA forecast in regime i, and F t = {Yt , Yt−1 , . . .} denotes the
information set up to time t. Depending on the case H ≤ d or H > d, there are
various approaches to calculate the forecast and the forecast error variance.
by
H−1
(1)
2 (2)
2
LS
Var(et+H|t ) = σε2 ωj It+H−d + ωj (1 − It+H−d ) , (10.14)
j=1
Case H > d: Observe that Yt+H−d ∈ F t . So the value of the threshold variable is
unknown. This makes the computation of the LS forecast more complicated. For
this case Amendola et al. (2006b) suggest the following forecast strategies.
• Least squares (LS) forecast : Clearly, under the stationarity assumption, It+H−d
becomes a Bernoulli random variable iH−d according to
1 with P(Yt+H−d ≤ r|F t ) ≡ p(H−d) ,
iH−d = (10.15)
0 with P(Yt+H−d > r|F t ) ≡ 1 − p(H−d) .
Thus, the indeterminacy regarding the future now hinges on p(H−d) . In this
case, the LS forecast in (10.13) reduces to
(2) (1) (2)
LS
Yt+H|t = Yt+H|t + p(H−d) (Yt+H|t − Yt+H|t ), (10.16)
396 10 FORECASTING
+ p + p2(H−d) − 2p · p(H−d)
∞
! (1) (2) (1) (2) "
× Var(Yt+H|t ) + Var(Yt+H|t ) − 2σε2 ωj ωj , (10.17)
j=h
(i)
where et+H|t is the forecast error in regime i, p the unconditional expected
value of It+H−d , Var(Yt+H|t ) = σε2 ∞
(i) (i) 2
j=h (ωj ) the forecast variance in regime
i (i = 1, 2), and the last term in squared brackets in (10.17) denotes the
covariance between the forecasts generated from the two regimes.
We note that the LS and PI forecasts strategies make use of the available in-
LS
formation set differently. Nevertheless, it is easy to prove that both Yt+H|t and
PI
Yt+H|t are unbiased estimators of Yt+H . However, in terms of minimum MSFE, the
gain in using one method over the other comes from their forecast error variances.
Since p(H−d) → p, as T → ∞, it can be deduced that Var(eLS t+H|t ) ≥ Var(et+H|t ) if
PI
with it+H−d the indicator function given by (10.18). Accordingly, the combined
forecast is as good as the best of the two forecast methods LS and PI. Note that
in practice a reasonable approximation of p(H−d) (H > d) is needed for all three
forecast strategies, and hence the quotation marks around “exact’.
(0.6, 0.6, −0.7, −1, 0.4, 0.5, 0) . So the difference between these models is
10.1 EXACT LEAST SQUARES FORECASTING METHODS 397
Table 10.1: Averaged MSFEs and MAFEs for the least squares (LS), plug-in (PI), and
combined (C) forecast strategies for the SETARMA(2; 1, 1, 1, 1) models specified in Example
10.2; T = 250, and 1,000 MC replications.
that the second model has intercept terms while the first one has not. It is
well known that non-zero intercepts can greatly extenuate or attenuate the
relative forecast performance of the SETARMA model. The number of MC
replications is set to 1,000 with {εi } ∼ N (0, 1), and T = 250. The forecast
i.i.d.
Figure 10.2: Boxplots of the forecast errors of the LS and PI forecast strategies; T = 250,
1,000 MC replications.
1 MCi
N
MC
Yt+2|t = Yt+2|t , (10.20)
N
i=1
where
MCi
with {ε2,i }N
i=1 a set of pseudo-random numbers drawn from the presumed distribu-
tion of {εt+1 }, and with N some large number. In general, the H-step ahead forecast
is given by
1 MCi
N
MC
Yt+H|t = Yt+H|t , (10.21)
N
i=1
where
MCi
MCi
10.2.2 Bootstrap
Forecasts obtained from the bootstrap (BS) method are similar to the MC simulation
method except that the e∗j,i are drawn randomly (with replacement) from the within-
sample residuals ei (i = 2, 3, . . . , T ), assuming a set of T historical data is available
to obtain some consistent estimate of θ. In this case the H-step ahead (H ≥ 2)
forecast is given by
1 BSi
T
BS
Yt+H|t = Yt+H|t , (10.22)
T −1
i=2
where
BSi
BSi
Yt+H|t = μ (Yt+H−1|t ; θ) + e∗H,i
The advantage of this method over the MC method is that no assumptions are made
about the distribution of the innovation process.
which we ‘switch off the white noise’ in (10.1). Thus, the two-step ahead forecast is
given by
SK
Yt+2|t = μ(Yt+1|t ; θ).
SK
Note that this approach leads to biased predictions since Yt+2|t = E(Yt+2|t ). By
induction, the H-step ahead forecast can be computed as
SK
Yt+H|t = μ μ(· · · μ(Yt+1|t
SK
; θ)) . (10.23)
1
T
ELS
Yt+2|t = μ μ(Yt+1|t ; θ) + ei . (10.24)
T −1
i=2
The ELS method can be readily extended to the case H > 2. For instance, the exact
three-step ahead LS forecast is given by
∞
LS
Yt+3|t = μ μ(μ(Yt+1|t ; θ) + ε) + ε dF (ε)dF (ε ).
−∞
1
ELS
Yt+3|t = μ μ(μ(Yt+1|t ; θ) + ei ) + ej .
(T − 1)(T − 2)
2≤i=j≤T
LS
Yt+H|t = ··· μ μ(· · · (μ(Yt ; θ)+ε1 )+· · · )+εH−1 dF (ε1 ) · · · dF (εH−1 ),
−∞ −∞
10.2 APPROXIMATE FORECASTING METHODS 401
ELS
Yt+H|t = μ μ(· · · (μ(Yt+1|t ; θ) + e1,i ) + · · · ) + eH−1,i , (10.25)
(T − 1)!
(H−1,T )
where the summation (H−1,T ) runs over all possible (H − 1)-tuples of distinct
(i1 , . . . , iH−1 ). Guo et al. (1999) show that the above prediction scheme is asymp-
totically equivalent to the exact LS forecast.
The ELS method can be easily generalized to NLAR models with conditional
heteroskedasticity. For instance, consider the model
Yt = μ(Yt−1 ; θ1 ) + εt σ(Yt−1 ; θ2 ),
where θi (i = 1, 2) is a vector of unknown parameters, μ(·; θ1 ) and σ(·; θ2 ) are two
real-valued known functions on R, and the εt ’s are assumed to satisfy E(εt ) = 1
for identification purposes. Given T observations, the series {ei } can be calculated
exactly from the model based on particular estimates of θi . Next, we use these
residuals as proxies for the disturbance term instead of random draws from some
assumed parametric distribution as in Section 10.2.1. Then, using the same idea
as above, the H-step ahead predictor follows directly. It is apparent that, in com-
parison with the MC predictor, the ELS predictor is less sensitive to distributional
assumptions about the error process.
ExpAR(1) model
To obtain the NFE forecast value for any step, we employ the following result.
Let r(Z) be a function of the random variable Z ∼ N (0, σZ2 ), and M and c are
i.i.d.
constants. Then
where A = 1 + 2cσZ2 , c1 = cA−1 , and V ∼ N (−2c1 σZ2 M, σZ2 /A); cf. Exercise 10.6.
i.i.d.
−3/2
2
where AH−1 = 1 + 2σe,H−1 , cH−1 = A−1 H−1 , ξH−1 = ξAH−1 . After substitution and
some algebra, the forecast error is given by
so that E(et+H|t ) = 0. Since et+H−1|t does not depend on future noise εt+H , the
forecast error variance is given by
2
σe,H = φ2 σe,H−1
2
+ ξ 2 vH−1 + 2φξuH−1 + σε2 , (10.28)
2 = σ 2 and, using (10.26) with c = 2,
where σe,1 ε
2 −1
with BH−1 = 1 + 4σe,H−1 , and dH−1 = 2BH−1 . Moreover, it can be deduced that
SETAR(2; 1, 1) model
Consider, as a special case of (10.11), the SETAR(2; 1, 1) model
where {εt } ∼ N (0, σε2 ). Assume that the (H −1)-step (H ≥ 2) ahead forecast errors
i.i.d.
For H = 2, it can be shown that (10.30) is identical to the two-step ahead ex-
act MMSE forecast; cf. Exercise 10.1. The above results can be directly extended
to more general SETAR models, including models with multiple regimes, and to
situations where the delay has a value greater than one. An additional advantage is
that for both ExpAR(1) and SETAR(2; 1, 1) models the NFE method can be rapidly
calculated using, for instance, a spreadsheet.
for stationarity are φ(1) < 1, φ(2) < 1, and φ(1) φ(2) < 1; see Table 3.1. Subject
NFE
to these conditions, we compute Yt+H|t for H = 3, . . . , 10 with parameter val-
ues φ(1) = −1.50, −1.25, . . . , 0.50, 0.75 and φ(2) = −1.75, −1.50, . . . , 0.50, 0.75.
Also, we obtain H-step ahead forecasts by the MC method, generating for each
step H 100,000 realizations of Yt+H . Next, for each parameter combination,
we calculate the relative mean absolute forecast error (RMAFE):
1
10
RMAFEt = |(Yt+H − Yt+H|t
MC MC
)/Yt+H|t |. (10.32)
8
H=3
Figure 10.3 shows a contour plot of (10.32). The results indicate good agree-
ment between the NFE and the MC method over a wide range of parameter
values. More generally, MC simulations show that for values of σε2 = 0.4 and 1
404 10 FORECASTING
Figure 10.3: Contour plot of (10.32) for the SETAR(2; 1, 1) model (10.29) with r = 0,
i.i.d.
Y0 = 0, {εt } ∼ N (0, 1). From De Gooijer and De Bruin (1998).
the SETAR–NFE method performs well as opposed to the exact and the MC
forecasting method. For σε2 = 2 NFE is quite reliable for forecasts up to, say,
five- or six-steps ahead.
10.2.6 Linearization
Another approach to approximate the exact forecast Yt+H|t is to linearize the prob-
lem. In particular, Taylor’s expansion up to order two of μ(·; θ) about the point
Yt+H−1|t (ignoring the remainder term), is
where μ(i) (·; θ), (i = 1, 2) denotes the ith derivative of μ(Yt+H−1|t ; θ) with respect
to Yt+H−1|t , and et+H−1|t is the (H − 1)-step ahead forecast error (H ≥ 2). We refer
to this approach as the linearization (LN) method.
Assume, for simplicity, that the forecasting error process {et+H−1|t } ∼
i.i.d.
N(0, σe,H−1 ) distributed. Then, substituting (10.33) in the NLAR(1) model and
2
taking the conditional expectation of the resulting specification, gives the H-step
ahead LN forecast, i.e.
1 2
LN
Yt+H|t # μ(Yt+H−1|t ; θ) + σe,H−1 μ(2) (Yt+H−1|t ; θ). (10.34)
2
10.2 APPROXIMATE FORECASTING METHODS 405
Substituting (10.34) in the corresponding H-step ahead forecast error and simplify-
ing gives
1
et+H|t = εt+H +e2t+H−1|t μ(1) (Yt+H−1|t ; θ)+ {e2t+H−1|t − σe,H
2
}μ(2) (Yt+H−1|t ; θ).
2
The forecast error variance for this step is given by the recurrence relation
2
(1)
2 1 4 (2)
4
σe,H = σε2 +σe,H−1
2
μ (Yt+H−1|t ; θ) + σe,H−1 μ (Yt+H−1|t ; θ) . (10.35)
2
Forecasts obtained from this method can be quite different from the exact prediction
method or from the NFE method for moderate or large σε2 (mainly ≥ 10−2 ). Al-
Qassem and Lane (1989) provide a discussion on the limiting behavior of (10.33)
in the case of the ExpAR(1) model. They emphasize the need for great caution in
using linearized forecasts in nonlinear models.
Extension of the LN method to ExpAR(p) is straightforward with a Taylor ex-
pansion of μ(·; θ) around the point Yt+H−1|t = (Yt+H−1|t , Yt+H−2|t , . . . , Yt+H−p|t )
where Yt+j|t = Yt+j if j ≤ 0. Similarly, an expression for the H-step ahead fore-
cast error variance can be obtained by assuming that the forecast errors have a
multivariate normal distribution.
where θ = (φ, ξ, γ) . The function μ(·; θ) has the following partial derivatives
with respect to X
LN
Yt+H|t = φ + ξfH−1 exp − γ(Yt+H−1|t )2 Yt+H−1|t ,
where
2
fH−1 = 1 + γσe,H−1 2γ(Yt+H−1|t )2 − 3 .
2
Thus, fH−1 is increasing with σe,H−1 2
. We also see that if σe,H−1 is large and
Yt+H−1|t is near zero, fH−1 can be negative. It seems that this is the root
cause of the instability of the LN method.
Figure 10.4(a) shows 50 forecasts obtained by the NFE, SK, and LN methods
applied to a typical single simulation of an ExpAR(1) model with φ = 0.8,
406 10 FORECASTING
Figure 10.4: Forecasts from the ExpAR(1) model in Example 10.4 with the NFE, SK,
and LN methods; (a) σε2 = 1, and (b) σε2 = 0.01.
ξ = 0.3, {εt } ∼ N (0, 1), and starting value Y0 = 1. By relation (6.6) the
i.i.d.
process has two limit points at ±0.6368. It is clear that the NFE forecasts go
to a limit point zero, SK forecasts go to the upper limit point 0.6368, while
the series of LN forecast are unstable up to about H = 30, then stabilize to a
point far off the upper limit point. Four more plots are given in Figure 10.4(b)
for σε2 = 0.01.
For short-term forecasting (H ≤ 5) there is hardly any noticeable difference
between the three forecasting methods, provided σε2 is small. On the other
hand, for long-term (H ≥ 30) forecasting the LN method may go to the
“wrong” limit point.
where θH∗ is a vector of parameters depending upon the forecast horizon H. These
estimates θH
∗ , the corresponding H-step ahead DE forecast can be written as
DE
Yt+H|t = μ(Yt ; θH
∗
). (10.37)
10.2 APPROXIMATE FORECASTING METHODS 407
In a linear setting, there are no gains in terms of increased forecast accuracy using
DE over the traditional minimization of in-sample sum of squares of one-step ahead
errors when the model is correctly specified. When a nonlinear model, however, is
correctly specified, the DE method may result in better out-of-sample forecasts due
to its simplicity. An obvious drawback of the method is that the nonlinear model
needs to be estimated for each forecasting horizon.
= μ(Yt ; θH
∗
) = θH Yt I(Yt ≤ rH ) + θH Yt 1 − I(Yt ≤ rH ) .
DE (1)
Yt+H|t (10.41)
Note that {ε∗t+H } is not a WN process, but has temporal relationships. So, in
general, the forecasts are biased.
In an MC simulation experiment Clements and Smith (1997) conclude that
the DE method is worse than the BS, MC and NFE forecasting methods for
SETAR(2; 1, 1) models with Gaussian disturbances and zero intercepts.
408 10 FORECASTING
10.3.1 Preliminaries
The forecast methods discussed in the previous two sections produce a single ap-
proximation for YT +H . Ideally, forecast intervals/regions are more informative than
point predictions as they indicate the likely range of forecast outcomes. As such,
a forecast interval/region is a measure of the inherent model accuracy. The con-
ditional distribution of YT +H given F t = {Yt , Yt−1 , . . .} forecast interval/region for
YT +H . Given Xt = x, Qα ≡ Qα (x) ⊂ R is such an interval with coverage probability
1 − α (α ∈ [0, 1]). That is P{YT +H ∈ Qα (x)|XT −H−p+1 = x} = 1 − α, assuming
the DGP is strictly stationary and Markovian of order p. The set Qα will be called
forecast region (FR). When Qα is a connected set, we call it a forecast interval
(FI). Obviously, such a region/interval can be constructed in an infinite number of
ways. For instance, a natural FI for the conditional median of YT +H is the so-called
conditional percentile interval (CPI) given by
where ξα (·) is the αth conditional percentile of ξα (·) defined by (9.11) with α ≡ q,
changing the notation of the quantile level q to the symbol α.
In the context of linear ARMA models, we normally construct a FI for H ≥
1 steps ahead by using an estimate of the conditional mean, an estimate of the
conditional variance, and, in addition, a certain critical value taken from either the
normal or the Student t distribution. For some nonparametric methods, FIs can be
constructed on the basis of available asymptotic theory of the forecast under study
(Yao and Tong, 1995). In general, however, some form of resampling is necessary
because of non-normality of the forecast errors and/or nonlinearity of the forecast.
Below, we consider both approaches, making a distinction between FI/FRs based
on percentiles and on conditional densities where in the latter case the shape of the
densities may change over the domain of Xt .
unknown functions on R. Let f (x) denote the density function of the lag vector
at the point Xt = x. Recall μ NW (x), the NW estimator of the conditional mean
function μ(x), is given by (9.36). Under certain mixing conditions it can be shown
(see, e.g., Fan and Gijbels, 1996, Thm. 6.1) that μ NW (x) is asymptotically normally
distributed with asymptotic bias and variance given by
NW
1 f (1) (x)
(x) = μ2 (K)h2 μ(2) (x) + 2μ(1) (x)
Bias μ , (10.44)
2 f (x)
NW
1 σ 2 (x)
(x) = R(K)
Var μ , (n = T − H − p + 1), (10.45)
nh f (x)
+ +
where μ2 (K) = R u2 K(u)du and R(K) = R K 2 (z)dz. Similarly, based on the
LL regression approach, the estimator μ LL (x) of μ(x) is asymptotically normally
distributed with asymptotic mean and variance
LL
1 LL
1 σ 2 (x)
(x) = μ2 (K)h2 μ(2) (x),
Bias μ (x) = R(K)
Var μ . (10.46)
2 nh f (x)
We see that the bias of the NW estimator does not only depend on the first- and
second derivatives of μ(x), but also on the score function −f (1) (x)/f (x). This is the
reason why an unbalanced design may lead to an increased bias, especially when p is
large and T is small. Clearly, consistent bias estimates of the NW and LL estimators
of μ(x) require estimates of μ(2) (x). Such estimates will possibly reduce the bias, and
hence improve forecast accuracy in small samples. On the other hand, the variance
may increase since more parameters have to estimated. Thus, it is reasonable to
construct FIs for both nonparametric conditional mean estimators without a small-
sample bias correction. Since the expression for the asymptotic variance is the same
NW (x) and μ
for μ LL (x), the resulting FI with coverage probability (1 − α) is defined
as
4
4
Var
μ(·) (x) Var
μ(·) (x)
FIα = μ (x) − zα/2 σ 2 (x) +
(·)
(x) + zα/2 σ 2 (x)+
,μ(·)
.
nh nh
(10.47)
Here, zα/2 denotes the (1 − α/2)th percentile of the standard normal distribution,
(·) (x) denotes the NW or the LL conditional mean forecast.
and the notation μ
(1)
p
(1) (2)
p
(2)
Yt = {φ0 + φi Yt−i }I(Yt−d ≤ r) + {φ0 + φi Yt−i }I(Yt−d > r) + εt ,
i=1 i=1
(10.48)
410 10 FORECASTING
where {εt } ∼ (0, 1) random variables, and p is assumed to be known. Given the
i.i.d.
initial, pre-sample, values (Y−p+1 , . . . , Y0 ) and the set of observations {Yt }Tt=1 , an
estimate rT of r follows from using Algorithm 6.2. We have seen in Section 6.1.2
that this estimator is super-consistent with the rate of convergence of Op (T −1 ).
Bootstrap FIs for linear ARs have received quite some attention; see, e.g., Pan
and Politis (2016) for a recent review. Within this context, BS can be based on the
backward and forward time representation of an AR(p) model. For SETAR models
there is no immediate way of inverting the lag polynomial augmented with indicator
variables. Hence, the so-called backward BS procedure does not apply in this case.
In contrast, the forward BS generates bootstrap series conditionally on the first p
observations of the observed series as the initial values of the bootstrap replicates.
Both Li (2011) and Pan and Politis (2016) use forward BS in a SETAR forecasting
context. One simple algorithm to construct the FI for Yt+H is as follows.
conditional on rT . Compute the EDF, say Fε, of the mean-deleted residuals
T
{εt = εt − ε}Tt=p+1 , where ε = (T − p)−1 t=p+1 εt and
p
p
εt = Yt − {φ0 + φi Yt−i }I(Yt−d ≤ r) + {φ0 + φi Yt−i }I(Yt−d > r).
(1) (1) (2) (2)
i=1 i=1
1.2 Draw (with replacement) BS pseudo-residuals {ε∗t } from Fε, and generate
the BS replicate of Yt , denoted by Yt∗ , as Yt∗ = Yt , (t = 1, . . . , p),
p
p
Yt∗ = {φ0 + φi Yt−i ≤ rT ) + {φ0 + φi Yt−i
(1) (1) ∗ ∗ (2) (2) ∗ ∗
}I(Yt−d }I(Yt−d > rT )
i=1 i=1
1.3 Based on the pseudo-data {Yt∗ }Tt=1 , and using rT , re-estimate the coefficients
φi . Obtain a new set of BS coefficients φi .
(j) ∗,(j)
where ε∗t+H is a random draw (with replacement) from Fε. So, the BS
forecasts are all conditioned on the forecast origin data.
∗,(b)
1.5 Repeat steps 1.1 – 1.4 B times, and obtain the BS forecasts {Yt+H }B
b=1 .
10.3 FORECAST INTERVALS AND REGIONS 411
Note that Algorithm 10.1 ignores the sampling variability of rT . To adjust for
this, step 1.3 can be repeated many times with BS threshold values obtained from
Algorithm 6.2; see Li (2011). Another modification follows from using bias-corrected
(j)
estimators of the coefficients φi ; see, e.g., Kilian (1998). In the context of linear AR
models, Kim (2003) provides a BS mean bias-corrected estimator which can simply
be adopted to correct for biases of SETAR coefficient estimators. In particular,
Algorithm 10.1 needs to be modified as follows.
2.2 Re-estimate (10.8) using {Yt∗ }Tt=1 and rT , and obtain the BS coefficients
φi
∗,(j)
(i = 0, . . . , p; j = 1, 2). Repeat this step C times to get a set of BS
coefficients {φi
∗,(c),(j) C
}c=1 .
∗,(j) (j)
2.3 Compute the bias of φi as Bias(φi ) = φi − φi where φi is the
(j) (j) (j)
2.5 Re-estimate (10.8) using {Ytc∗ }Tt=1 and rT , and obtain the BS coefficients
φi
∗,(c),(j)
. Next, compute the bias-corrected BS forecasts as Ytc∗ = Yt (t =
T, T − 1, . . . , T − p + 1),
p
Yt+H = {φ0
c∗ ∗,(c),(1)
+ φi
∗,(c),(1) c∗
Yt+H−i }I(Yt+H−d
c∗
≤ rT )+
i=1
p
{φ0 φi
∗,(c)(2) ∗,(c),(2) c∗
+ Yt+H−i }I(Yt+H−d
c∗
> rT ) + ε∗t+H .
i=1
2.6 Repeat steps 2.1 – 2.4 B times and obtain a set of bias-corrected forecasts
c∗,(b)
{Yt+H }B c
b=1 . The bias-corrected BFI (BFI ) with coverage probability (1−α)
is given by
412 10 FORECASTING
Note that the bias-correction in step 2.3 can push the coefficients into the non-
stationary region of the parameter space; see, e.g. Clements (2005, Section 4.2.4),
Kilian (1998), and Li (2011) for a stationarity correction procedure which can easily
be implemented in Algorithm 10.2. Another modification is to replace the fitted
residuals by predictive residuals (Politis, 2013, 2015). For a SETAR(2; p, p) model
these residuals can be computed as follows: Delete the row (1, Yt−1 , . . . , Yt−p ) in
the T × (p + 1) design matrix Xt (r) (see (6.11)), and delete Yt−t from the series
{Yt }Tt=1 . Next, compute the leave-one-out CLS estimator of the model coefficients
using (6.11), and obtain the leave-one-out fitted value Yt−t Then the predictive re-
siduals are given by ε−t
t = Yt − Yt−t . The key idea here is that the distribution
of the one-step-ahead forecast errors can be approximated better by the EDF of
ε−t
{ t }t=p+1 than by the EDF of {
T εt }Tt=p+1 ; cf. Exercise 10.7.
and α = 0.05. To assess the performance of the BFIs, we use the empirical
coverage rate (CVR) defined by
1
m
CVRH,α = I Yi,T +H ∈ FI(·)
α , (10.53)
m
i=1
where Yi,T +H denotes the H-step ahead forecast made at time t = T from the
(·)
ith data set, and FIα denotes either BFIH,α or BFIcH,α .
Figure 10.5 shows boxplots of the CVRH,α for H = 1, . . . , 5 and m = 100.
There are no serious size distortions in coverage rates; both BFIs have an IQR
of about 0.03, on average, across all values of H. This implies that the BFIs
generally work well. The variability of the threshold variable estimator does
not seem to cause higher CVRs in the case of BFI cH,α . Moreover, the CVRs
seem to remain fairly constant as H increases with average standard deviation
10.3 FORECAST INTERVALS AND REGIONS 413
Figure 10.5: Empirical CVRs for (a) BFIH,α and (b) BFIcH,α for the SETAR(2; 1, 1)
model (10.52); T = 100, α = 0.05, B = 1,000, m = 100, and 500 MC replications.
of about 0.02 in both cases (a) and (b). Observe that (10.53) represents an
unconditional coverage probability since YT is different for each simulated data
set.
The so-called shortest conditional modal interval (SCMI) with coverage probability
1 − α is defined as
SCMIα (x) = [mα/2 (x) − bα/2 (x), b1−α/2 (x) + b1−α/2 (x)], α ∈ [0, 1]. (10.56)
3
The property of variable-size FIs is commonly named sharpness or resolution. Sometimes a
subtle difference is made between both terms in the sense that sharpness relates to the average size
of FIs and resolution to their associated variability; cf. Exercise 10.8.
414 10 FORECASTING
It follows from (10.54) and (10.55) that the SCMI can also be defined as
[a, b] = arg min{Leb{[c, d]} |F (d|x) − F (c|x) ≤ 1 − α}, α ∈ [0, 1], (10.57)
where Leb(C) denotes the Lebesgue measure of the set C, which is a measurable
subset of Rp , and F (·|x) the conditional distribution function of Yt given Xt = x.
Thus, the idea is to search for the set with the minimum length among all predictive
sets; see Fan and Yao (2003, Section 10.4) for a more thorough discussion.
Of course, in practice, a natural estimator for the SCMI is obtained by replacing
F (·|x) by a consistent estimate, e.g. the NW or the LL kernel-based estimator. For
symmetric and unimodal conditional predictive distributions SCMI reduces to CPI.
We call the subset Rα the 100(1 − α)% HDR of f (·|x) (cf. Hyndman, 1995, 1996)
such that
Thus, the HDR is naturally related to the conditional mode since they are both
based on points of highest density. The HDR can be equivalently defined as
5
5
[ai , bi ] = arg min Leb [ci , di ] c1 < d1 ≤ c2 < d2 ≤ · · · ≤ c < d ,
i=1 i=1
and {F (di |x) − F (ci |x) ≤ 1 − α} ,
i=1
where ≥ 1 denotes the number of sub-intervals. Replacing F (·|x) by, for instance,
the NW smoother gives an estimator of the HDR. By definition, HDR is of the
smallest Lebesgue measure among all FRs with the same α. The HDR may consist
of less than disconnected intervals even though f (·|x) has modes. Equivalently as
the SCMI, the HDR reduces to the CPI when f (·|x) is unimodal and also symmetric
with respect to its mode.
Figure 10.6: Hourly river flow data set. (a) One-step ahead forecast Yt+1|t
Mdn
and estimated
SCMI’s (with coverage probability 0.9); (b) One-step ahead forecast Yt+1|t
Mdn
and estimated
HDRs (with the highest coverage probability). From De Gooijer and Gannoun (2000).
at the tth hour Yt from the observed values of Yt−1 using the nonparametric
predictor Yt+H|t
Mdn
, defined in (9.7). We use a Gaussian kernel, and set p = 1. As
a starting-point we select t = 366 which is just located before the large peak
in {Yt } at time t = 374. Next, we predict Y368 using the observed values up to
an including the one at t = 367. This procedure is repeated till the end of the
series. Hence, in total 35 one-step ahead predictions are available. Further,
with coverage probability (1 − α) = 0.9, we estimate the SCMI and the HDR
in each step. The bandwidths follow from minimizing CV Mdn (H); see Table
9.1.
Figures 10.6(a) and 10.6(b) show plots of the last 35 observations of {Yt } with
one-step ahead forecasts Yt+1|t
Mdn
for the SCMI and the HDR. Clearly, the SCMI
is very wide and asymmetric whereas the HDR is much tighter. Note, however,
that at t = 370 – 375, 378 – 380, 382 – 383, 391, and 400 the realizations do
not fall within the HDR. On the other hand, the SCMI does not cover the
corresponding observed values at t = 371 – 372, 374 – 375, 382, and 385.
Similar observations were noted for FRs based on Yt+H|t
Mean
, defined in (9.5). For
the time period t = 370 – 375 this is due to a steep rise in river flow, due to
heavy rainfall (3.2 mm/hour at t = 374). Thus, the width of both FRs can be
quite sensitive to the position in the state space from which predictions are
being made.
same quantity are available via rival forecast methodologies. Then the question
naturally arises as how likely it is that differences between the two forecasts is due
to chance or whether they are “significant”. Below we review various tests for
comparing the accuracy of competing point forecasts. First, we describe the basic
forecast setup.
Setup
Let {Yt }Tt=1
+H
be the sample of observation, where H ≡ Hmax ≥ 1 denotes the longest
forecast horizon of interest. We assume that the available data set is divided into
in-sample and out-sample portions, with R (R as in Regress) the total number of
in-sample observations and P the number of H-step ahead forecasts. Thus, R + P +
H − 1 ≡ T + H is the size of the available sample. Note that this setup implies that
P out-of-sample forecasts depend on the same parameter vector estimated on the
first R observations. So, the forecast scheme is based on a single, fixed, estimation
sample.
Alternatively, a rolling or a recursive forecasting scheme can be employed. In the
latter case, the first forecast is based on a model with parameter vector estimated
using {Yt }Rt=1 , the second on a parameter vector estimated using {Yt }t=1 , . . . , the
R+1
R+P −1
last on a parameter vector estimated using {Yt }t=1 , where T ≡ R + P − 1. In
the rolling scheme, the sequence of parameter estimates is always generated from
a fixed, but rolling, sample of size R: The first forecast is based on parameter
estimates obtained from the set of observations {Yt }R t=1 , the next on parameter
estimates obtained from {Yt }R+1t=2 , and so on.
P (d − μd ) −→ N 0, Var(d) , (10.60)
10.4 FORECAST EVALUATION 417
where
1
H−1
Var(d) ≈ γd (), (10.61)
P
=−(H−1)
with γd (·) the ACVF of {dt , t ∈ Z}. The lag autocovariance can be estimated by
1
R+P +H−1
d () =
γ (dt − d)(dt− − d), ∈ Z.
P
t=R+H+
,
Then a consistent estimate Var(d) of Var(d) follows directly. The resulting asymp-
totic distribution of the DM test statistic is then
d D
DM =
−→ N (0, 1), as P → ∞. (10.62)
,
Var(d)
Modified DM test
When the forecast errors are Gaussian distributed or fat tailed, MC simulation res-
ults (Diebold and Mariano, 1995) indicate that the DM test statistic, under quad-
ratic loss, is robust to contemporaneous and serial correlation in large samples, but
the test is oversized in small samples. Indeed, for a small number of forecasts it is
4
It is good to mention that the null hypothesis of the Giacomini–White approach is different from
that of West and his co-authors in two respects: (i) the loss function L(·) depends on estimates
rather than their probability limits; and (ii) the expectation in (10.59) is conditional on some
information set.
418 10 FORECASTING
1
H−1
Var(d) = γd (0) + 2P −1 (P − )γd () . (10.63)
P
=1
,
Then Var(d) can be written as
1 ∗
H−1
,
Var(d) = d (0) + 2P −1
γ γd∗ () ,
(P − ) (P ≥ 2), (10.64)
P
=1
where
R+P
+H−1
1
d∗ () =
γ (dt − d)(dt− − d).
P −
t=R+H+
Assume the mean of {dt , t ∈ Z} is known and, without loss of generality, can be
taken to be zero. With a little algebra (cf. Exercise 10.5), it follows that for P
∗
P + 1 − 2H + P −1 H(H − 1)
,
E Var(d) ≈ Var(d). (10.66)
P
The term P −1 H(H − 1) is included here, since (10.66) is exact in the special case
where the process {dt , t ∈ Z} is WN.
As an implication of (10.66), the DM test statistic can be modified (m) for its
finite sample oversizing by using an approximately unbiased variance estimate, say
, m (d). The resulting MDM test statistic is therefore simply
Var
d P + 1 − 2H + P −1 H(H − 1) 1/2
MDM =
= DM, (10.67)
, m (d) P
Var
where
H−1
as
(1−α) (1−α)
(α) 1 if Yt+H ∈ [Lt+H|t , Ut+H|t ],
it = (10.68)
0 otherwise,
where P denotes the number of H-step ahead forecasts, and R the total number of in-
sample observations. Thus, the indicator (or “hit”) function tells whether the actual
value Yt+H lies (a “hit”) or does not lie (a “miss” or a “violation”) in the FI for that
lead time H. The sequence of interval forecasts is said to be “well-specified’ with
(α) (α) (α)
respect to the past information set Ψ t = {it , it−1 , . . .} if E(it |Ψt−1 ) = 1 − α ≡ p.
Within this framework, Christoffersen (1998) proposes the following, widely used,
LR-based test statistics.
(α)
P
(α) (α)
n1 = #{it = 1} = it and n0 = #{it = 0} = P − n1 .
t=1
The likelihoods of the data under the null and alternative hypotheses are, respect-
ively,
(α) (α) (α) (α)
Lp ≡ L(p; i1 , . . . , iP ) = (1 − p)n0 pn0 and Lπ
≡ L(
π ; i1 , . . . , iP ) = (1 − π
)n1 π
n1 ,
(α)
suggests testing for independence by modeling the process {it , t ∈ Z} as a two-
state (i.e., k = 2 in the notation of Section 2.10) first-order Markov chain with
transition probability matrix
1 − p12 p12
P1 = , (10.70)
1 − p22 p22
(α) (α)
where pij = P(it = j|it−1 = i) and 2j=1 pij = 1 (i, j = 1, 2). Let nij denote
the number of events that a state i is followed by a state j. Then the approximate
likelihood function under the alternative hypothesis for the whole process is
with pij = nij /(ni1 + ni2 ) (i, j = 1, 2) the ML estimate of pij . Under the null
(ind)
hypothesis H0 : p12 = p22 , the state of the process at time t conveys no information
on the relative likelihood of it being in one state as opposed to another at time
(α)
t + 1. Thus, when the outcome, say it , of the chain lies in state j, the nearest
(α)
outcome it−1 has the same probability of lying in any state. We can write this as
(α)
p1j = p2j = πj , where πj = P(it = j) (j = 1, 2). Let nj denote the corresponding
number of outcomes. Then the ML estimate of πj is given by π j = nj /N with
(ind)
N = 2i,j=1 nij . Hence, the approximate likelihood function under H0 is LP
0 ≡
L(P0 ; i1 , . . . , iP ) = 2j=1 nj /N )nj , and the unrestricted likelihood function is
(α) (α)
n
LP
1 ≡ L(P1 ; i1 , . . . , iP ) = 2i=1 2j=1 nij / 2j=1 nij ij . Then the LR-based test
(α) (α)
LRind = −2 log(LP
1 /LP
0 ). (10.72)
(ind)
Under H0 , and as P → ∞, LRind has a χ2(2−1)2 distribution. Similarly, it is
straightforward to show that for a k-state (k ≥ 2) first-order Markov chain, the
corresponding LR-based test statistic has (asymptotically) a χ2(k−1)2 distribution
under the null hypothesis.
(cc)
Under H0 it follows (Christoffersen, 1998) that, as P → ∞, the test statistic LRcc
has a χ22 distribution. For a k ≥ 2 state first-order Markov chain, the corresponding
10.4 FORECAST EVALUATION 421
of {Dj }N j=1 . Then a hit function is said to have a tendency to clustering of violations
if Mdn(DN :N /D[N/2]:N ) is higher than the median of the process {Dj , j ∈ Z+ } under
(ind)
H0 .
Next, as a special case of the proposed class of independence tests, Araújo Santos
and Fraga Alves (2012) define the test statistic
DN :N − 1
TN,[N/2] = log 2 − log N. (10.74)
D[N/2]:N
The test statistic is pivotal in the sense that its distribution does not depend on an
(ind)
unknown parameter. However, (10.74) is a test statistic for H0 , not for testing
(cc) (ind)
H0 . The decision rule for rejecting H0 can be based on critical values (using an
exact distribution) provided by Araújo Santos and Fraga Alves (2012, Appendix) or
by simulating p-values (cf. Exercise 10.11).
6
Within the Value-at-Risk (VaR) evaluation literature of FIs these test statistics are often called
backtesting procedures.
422 10 FORECASTING
Under the null hypothesis (H0 ) that the model forecasting density corresponds to
the true conditional density, given by the DGP which is denoted by ft (·|F t−1 ),
that is ft (·|F t−1 ) = ft (·|F t−1 ), the process {Ut , t ∈ Z} is i.i.d. U (0, 1) distributed
(Rosenblatt, 1952).
A simple way of testing the uniformity part of the null hypothesis conditional on
the i.i.d. assumption is by using a nonparametric GOF test like the KS, AD or CvM
test statistics; see, e.g., Chapter 7. Alternatively, a plot of the CDF of the Ut may be
◦
used and visually compared with a line at an angle of 45 representing the cumulative
uniform distribution. The independence part of the null hypothesis may be tested by
using an LM-type test for serial correlation in the sequences {(Ut −U)u }Pt=1 (u = 1, 2),
where U is the sample mean of the Ut . For the case u = 2, the sample ACF may
indicate some form of nonlinear dependence such as heteroskedasticity. Similar
evaluation techniques can be applied to the transformed sequence {Φ−1 (Ut )}Pt=1
which is i.i.d. N (0, 1) distributed under the null hypothesis (Berkowitz, 2001). Other
ways of testing forecast densities are given in the next chapter, albeit in a vector
nonlinear time series framework.
Recall the monthly ENSO time series discussed in Examples 1.4, 5.1, and
6.4. We proceed by evaluating the out-of-sample forecast performance of the
nonlinear LSTEC model (6.24) as opposed to its linear (AR-type) counterpart
(6.25) using a rolling forecasting approach. In (6.24) an LSTEC model was
fitted to {ΔYt }468
t=1 , covering the time period January 1952 – December 1990.
This period will serve as the first in-sample set. The last estimation window
ends with December 2008 (T = 684). Hence, in total, we estimate 216 linear
and nonlinear models on a monthly basis while, following Ubilava and Helmers
10.4 FORECAST EVALUATION 423
Figure 10.7: Predictive probabilities of ENSO events, using information up to and includ-
ing June 1997. (a) Linear ECM (6.25), (b) LSTEC model (6.24), and (c) actual realization.
(2013), the AR order p and the delay lag d of the transition variable are re-
examined on an annual basis with d = 1, . . . , 6 and p = 1, . . . , 24 as possible
candidate values. We set Hmax = 36 (months).
Genuine out-of-sample forecasts are obtained via a block bootstrap approach
to mitigate for the effects of potential residual autocorrelation and heteroske-
dasticity, and we fix the number of BS replicates at 1,000.
To assess the accuracy of the fitted time series models in forecasting El Niño
◦
and La Niña events, we introduce five thresholds windows: SST ≤ −0.9 C
◦ ◦
(“Extreme” La Niña), −0.9 C < SST ≤ −0.5 C (“Moderate” La Niña),
◦ ◦ ◦ ◦
−0.5 C < SST < 0.5 C (Normal conditions), 0.5 C < SST < 0.9 C (“Moder-
◦
ate” El Niño), and SST ≥ 0.9 C (“Extreme” El Niño). For each window, and
each forecast horizon, we compile probability forecasts of ENSO events using
empirical forecast densities.
Figure 10.7 shows probability forecasts using information up to and including
June 1997 – when ENSO conditions are normal. For short-term, 3 months
424 10 FORECASTING
ahead, forecasting both linear and nonlinear models yield comparable results.
Note, the overall picture changes for 6 – 12 months ahead when the LSTEC
model forecasts the upcoming extreme El Niño episode with about a twice
as large probability than the linear model. In reality the 1997 – 1998 time
period showed the strongest El Niño event since 1950. Note that this period
was followed by a period of extreme La Niña, starting in the Fall of 1998 and
continuing into 1999 and 2000. Again, the LSTEC model is able to forecast the
beginning of this episode with a relatively high forecast accuracy (about 24%
probability) as compared to the linear model, which forecasts this up-coming
event with a modest 14% probability.
In addition, the DM test statistic rejects the null hypothesis of equality of
MSFEs for H = 1, . . . , 10, 14, 20, . . . , 27, 31, . . . , 37 with p-values < 0.03. For
H = 11, 12, 13, 28, 29, and 30 the DM test statistic indicates that there is
no statistically significant improvement in forecast accuracy of the nonlinear
model over the linear model. Moreover, for H = 15, . . . , 19 negative variance
estimates of d were obtained. Diebold and Mariano (1995) suggest that the
variance estimate should then be treated as zero and the null hypothesis of
equal forecast accuracy be rejected. All these results indicate a preference for
the LSTEC model in ENSO forecasting.
The above observation is further supported by Figure 10.8(a) displaying the
RMSFEs from both models, and by Figure 10.8(b) showing the percentage
correctly predicted ENSO events. As we see, up to H = 20 the LSTEC
model shows the largest improvement in forecast accuracy as measured by
the RMSFE. Figure 10.8(b) reveals that La Niña events are more accurately
predicted by the LSTEC model than El Niño events. In addition, the LSTEC
model is more effective in forecasting La Niña over a notably longer time
period.
10.5 FORECAST COMBINATION 425
follow from minimizing some loss function, usually the MSFE. 8 However, in empir-
ical applications equal-weighting (ew) often outperforms estimated optimal forecast
combinations9 , i.e.
1
n
ew
Yt+H|t = Yi,t+H|t . (10.76)
n
i=1
Interval forecasts
FIs are frequently too narrow, i.e. too many observations are in the tails of the
forecast distribution; Chatfield (1993) discusses seven reasons for this problem oc-
curring. One most likely reason is that forecast errors are not normally distributed
because the underlying DGP is nonlinear. Granger (1989) suggests a simple method
to construct realistic, non-symmetrical FIs. The method
combines the H-step ahead
conditional quantile predictor {ξi,q (x)}ni=1 q ∈ (0, 1) obtained from n different time
series models with weights wi,q (x) based upon within-sample estimation. That is,
n
ξqC (x) = wi,q (x)ξi,q (x), (10.77)
i=1
where the weights are chosen to minimize the (local linear) “check” function; see
Section 9.1.2.
If the conditional quantile estimators are unbiased, then we might
expect that ni=1 wi,q (x) ≈ 1, and this constraint could be used for simplification,
assuming the individual conditional quantile functions ξi,q (x) are sufficiently smooth.
7
The weights may change through time; see Deutsch et al. (1994) for an example. Note that
the underlying DGP may or may not be second-order stationary.
8
If the component forecasts are biased, it is recommended (Granger and Ramanathan, 1984) to
add a constant to the combined forecasting model and not to constrain the weights to add to unity.
9
This is known as the forecast combination puzzle ; see, e.g, Huang and Lee (2010), Smith and
Wallis (2009), Aiolfi et al. (2011), and Claeskens et al. (2016), for some answers to this puzzle.
426 10 FORECASTING
A combined conditional percentile interval then follows from (10.42); see Granger et
al. (1989b) for an application.
Density forecasts
Generalizing the notation in Section 10.4.3, we denote n sequences of P indi-
vidual one-step ahead forecast densities of a process {Yt , t ∈ Z} at some time t, as
{fi,t (Yt |F i,t−1 )}Pt=1 , where F i,t−1 represents the ith information set (i = 1, . . . , n).
Then, assuming the density forecasts are continuous, the combined density forecast
is defined as
n
ftC (Yt ) = wi fi,t (Yt |F i,t−1 ), (t = 1, . . . , P ), (10.78)
i=1
with wi ≥ 0 and ni=1 wi = 1.10 This combined density satisfies certain proper-
ties such as the “unanimity” property which amounts to saying that if all fore-
casters agree on the probability of a certain event then the combined probab-
ility agrees also. Further characteristics of +ftC (·) can be drawn out by, for in-
stance, defining the forecast mean μi,t = −∞ yt fi,t (yt |F i,t−1 ) dyt and variance
∞
+
(yt − μi,t )2 fi,t (yt |F i,t−1 ) dyt of the ith density sequence at time t. The
∞
σ2 =
i,t −∞
combined one-step ahead density has mean and variance
n
C
n
n
E ft (Yt ) = μt =
C C
wi μi,t ,
Var ft (Yt ) = 2
wi σi + wi (μi,t − μCt )2 .
i=1 i=1 i=1
(10.79)
The second equation of (10.79) indicates that the variance of the combined density
equals the average individual uncertainty (“within” model variance) plus a measure
of the dispersion of the individual forecast (“between” model variance). This result
stands in contrast to the combined, optimal point forecast which has the smallest
MSFE within the particular set of individual point forecasts (cf. Exercise 10.6).
Clearly, as before, the key issue is to find wi . Most simply, various authors (see,
e.g., Hendry and Clements, 2004) advocate the use of equal weights wi = 1/n. A
related topic is finding the set of weights in (10.78) that minimize the Kullback–
Leibler divergence (see (6.48)) between the combined density forecast and the true,
but unknown, conditional density ft (·|F t−1 ); see, among others, Bao et al. (2007)
and Hall and Mitchell (2007).
various exact and approximate methods for the generation of point forecasts. We
then described general methods for constructing forecast intervals and regions. We
also considered methods and test statistics for the evaluation of sequences of sub-
sequent point, interval, and density forecasts. Finally, we discussed some weighting
schemes for the optimal combination of model-based forecasts.
We would like to stress that this chapter introduced the major forecasting, eval-
uation and combination methods. As such, the chapter may well serve as a starting
point for anyone who intends to do empirical work. Table 10.2 can be helpful in
choosing an appropriate test statistic for forecast evaluation. Kock and Teräsvirta
(2011) provide additional literature on nonlinear forecasts (conditional means) of
economic time series obtained from parametric models, including NNs. Cheng et al.
(2015) summarize the “state-of-the-art” of forecasting models for complex (nonlinear
and nonstationary) biological, physical, and engineering dynamic systems.
Table 10.2: Overview of some forecast evaluation tests: Forecast errors are denoted by
ei,t ≡ ei,t+H|t (i = 1, 2), PEE = parameter estimation error, and HLN = Harvey, Leybourne,
and Newbold (1998). Based on Clark (2007).
Section 10.2.4: Lai and Zhu (1991) consider adaptive multi-step ahead MMSE predictors
for NLAR models when the parameters are unknown, and provide a numerical comparison
LS
between their forecast method and the exact LS forecast Yt+H|t .
Section 10.2.5: Clements and Smith (1997) compare a number of alternative methods of
obtaining multi-step SETAR forecasts, including the NFE method. They conclude that the
MC method performs reasonably well. The BS forecast method is preferred when the errors
in the SETAR model come from a highly asymmetric distribution. Other comparisons
include Amendola and Niglio (2004), Brown and Mariano (1984), Clements and Krolzig
(1998), and Clements and Smith (1999, 2001). Niglio (2007) investigates forecasts from
SETARMA models under asymmetric (linex) loss.
Section 10.2.6: Linearization is often used by control engineers in filtering and nonlinear
system analysis. Apart from the Taylor series expansion there exists several other lineariz-
ation methods of nonlinear state equations; see, e.g., Jordan (2006).
Section 10.2.7: The DE forecasting method was first introduced by Granger (1993, p. 132)
and called direct forecasting.
Section 10.3: Similar to the construction of kernel-type nonparametric BS confidence in-
tervals, nonparametric BFIs can also be based on pivotal statistics which are more conducive
for theoretical analysis. De Brabanter et al. (2005) construct such an interval. Moreover,
they provide an algorithm for the wild bootstrap.
The modal interval SCMI was originally proposed by Lientz (1970, 1972) for unconditional
distribution functions. Hyndman (1995, 1996) was the first to construct HDRs for un-
conditional densities. Yao and Tong (1995) and De Gooijer and Gannoun (2000) provide
applications of FRs and FIs with both real and simulated time series. Polinik and Yao
(2000) establish various asymptotic properties of the conditional HDR, called minimum
volume predictive region. The HDR estimation problem has been the focus of many papers;
see, e.g., Samsworth and Wand (2010) who study the asymptotic and optimal bandwidth
selection for nonparametric HDR estimation of a sequence of i.i.d. random variables.
Section 10.4.1: There is a myriad of theoretical papers dealing with extensions and modi-
fications of the DM test statistic; see, e.g., Harvey et al. (1997), Corradi et al. (2001),
Clements et al. (2003), Van Dijk and Franses (2003), and White (2000). West (2006) and
Corradi and Swanson (2012) provide surveys of the “state-of-the-art”. Two well-received
empirical studies dealing with forecast evaluation are by Swanson and White (1997a,b).
Recently, Diebold (2015) gives some personal reflections about the history of the DM test
statistic. The test was originally developed to compare the accuracy of model-free forecasts.
Mariano and Preve (2012) consider a multivariate version of the DM test statistic with
multiple forecasts and forecast errors from more than two alternative models.
Note, the section does not include nonparametric techniques. For instance, assuming that
the loss differentials are i.i.d., a standard sign test may be performed to test the null hy-
pothesis that the median of the loss-differential distribution is equal to zero. Alternatively,
Wilcoxon’s signed rank sum test for matched pairs can be used for this purpose. Also,
Pesaran and Timmermann (1992) propose a nonparametric test statistic for the null hy-
pothesis that there are no predictable relationships between the actual and predicted sign
changes of the predictand. Swanson and White (1997a,b), Chung and Zhou (1996) and
Jaditz and Sayers (1998) each construct nonparametric test statistics for out-of-sample fore-
casting.
430 10 FORECASTING
Gneiting (2011) demonstrates that averaging individual point forecasts, summarized in meas-
ures such as the MAFE and MSFE, can lead to grossly misguided inferences, unless there
is a careful matching between the evaluation (loss) function and the forecasting task.
Section 10.4.2: There exists a large number of studies (see, e.g., Clements and Taylor,
2003; Engle and Manganelli, 2004; Berkowitz et al., 2011; Dumitrescu et al., 2013, and
the references therein) offering alternative approaches to testing for independence; see also
Campbell (2007) for a review.
Section 10.4.3: Diebold et al. (1998, 1999a,b) popularize the idea of using PITs in the
context of macro-econometrics; see Tay and Wallis (2000) for a survey. Wallis (2003) suggests
another way of evaluating density forecasts. Mainly it recasts the LR uc and LRind test
statistics into the framework of a Pearson χ2 test. Unfortunately, this approach lacks the
additivity property of the likelihoods. In fact, it is easy to see that the LR and Markov chain
based FI evaluation approach can be directly extended to the case of evaluating density
forecasts.
A number of empirical studies have shown that nonlinear models produce superior interval
and density forecasts (see, e.g., Clements and Smith, 2000; Ma and Wohar, 2014). Rapach
and Wohar (2006) compare out-of-sample point, interval and density forecasts generated by
the Band–TAR, ESTAR, and linear AR models. The quality (i.e. the statistical performance)
and the operational value of probabilistic forecasts is a primary requirement of many studies
of atmospheric variables. Within this context nonparametric evaluation methods play an
important role; see, e.g., Pinson et al. (2009) and the reference therein.
Section 10.5: Since the seminal work of Bates and Granger (1969) a voluminous literature
has emerged on combining; see Timmermann (2006) for a recent review, and Granger (1989)
and Wallis (2011) for some extensions. One recent paper is Adhikari (2015) who proposes
a linear combination method for point forecasts that determines the combining weights
through a novel NN structure.
Software References
Section 10.1.1: FORTRAN77 code, written by Yuzhi Cai, to find the “exact” conditional
pdf of two-regime SETAR models and STAR models is available at the website of this book.
Section 10.2.1: The PI and LS SETARMA forecast results presented in Table 10.1 of
Example 10.2 are obtained by the LS-PI-forecast.r function, available at the website of this
book. The computer code was provided by Marcella Niglio, who also supplied the Linux-
Procedure.r function related to the generation of forecasts using the linex asymmetric loss
function.
Clements (2005, Chapter 8) contains sample GAUSS code for the estimation and forecasting
(MC method) of SETAR(2; 1, 1) models.
Section 10.3: The BS forecast intervals in Example 10.6 are computed using a RATS code
provided by Jing Li. A MATLAB function for computing BFIs is available at the website
of this book. The R-BootPR package provides a way to obtain BS bias-corrected coefficients
for forecasting linear AR models. The code can easily be adapted to SETAR-type models.
EXERCISES 431
The R-hdrcde package contains computer code for the calculation and plotting of HDRs.
GAUSS and MATLAB codes for computing the conditional mean, median, mode, SCMI
and HDR are available at the website of this book.
R codes for the estimation, forecasting, and out-of-sample evaluation of the ENSO series are
available in the file Example 6-4.zip.
Section 10.4.1: The MATLAB function dmtest retrieves the DM test statistic (under
quadratic loss) using the Newey–West (1987) estimator for the covariance matrix of the loss
differential. The R-forecast package contains the function dm.test. Some old R code for the
DM test statistic is available at the R-help forum: http://r.789695.n4.nabble.com/R-
help-f789696.html.
The URL http://qed.econ.queensu.ca/jae/datasets/alquist001/ has MATLAB code
used in Alquist and Killian (2010) to calculate the DM test statistic under both quadratic and
absolute loss, the Clark–West (2006) test statistic, and the Pesaran–Timmermann (1992)
test statistic.
Section 10.4.2: MATLAB code for computing the three LR-based test statistics is available
from http://www.runmycode.org/companion/view/93. GAUSS code for MC evaluation
of interval lengths and coverages is given by Clements (2005, Chapter 8).
Exercises
Theory Questions
10.1 Consider the strictly stationary NLAR(1) process
1/2
Yt = ωYt−1 + εt ,
i.i.d.
where ω > 0, and {εt } ∼ U (a, b) distributed with 0 ≤ a < b < ∞. Recall from
Section 10.1.1 that the exact H-step ahead point forecast is given by E(Yt+H |Yt ) =
fH (Yt ) (H ≥ 1) using the short-hand notation fH (·) = fYt+H|Yt (·|x). Moreover, it is
convenient to introduce the functions g0 (x) = x, gH (x) = ω(gH−1 (x)) + με for H ≥ 1.
Naive
Then Yt+H|t = gH (Yt ) is the naive H-step forecast of Yt+H , i.e. an SK (skeleton)
forecast with additive WN.
(a) Show that the exact three-step ahead LS conditional pdf is given by
∞
a+b
f3 (x) = f2 (y)g y − μ(x) dy =
−∞ 2
8
+ Q(a, b, x)R(a, b, x) + Q(b, a, x)R(b, a, x)
105ω(b − a)2
− Q(a, a, x)R(a, a, x) − Q(b, b, x)R(b, b, x) ,
where
√
Q(u, v, x) = u+ω v + ω x,
√ √ √
R(u, v, x) = 2u3 − u2 ω v + ω x − 8uω 2 (v + ω x) − 5ω 3 (v + ω x)3/2 .
(b) Let z ≥ 0 be a given number. Then the equation x = ωx1/2 + z has a unique
positive root xz . Especially, if z = 0, x0 = ω 2 . Furthermore, xz is an increasing
function of z. Define α = xa and β = xb . It is easy to verify that Yt−1 ∈ [α, β]
implies Yt ∈ [α, β] ∀t > 0. It can also be proved that for arbitrary Y0 ≥ 0 the
process {Yt , t ∈ Z}, after a finite number of steps, falls with probability 1 into
[α, β] and remains there.
Compute the functions fH (Yt ) and gH (Yt ) for H = 2 and 3, with Yt ∈ [α, β] for
the following two cases:
(i) ω = 1, a = 0, and b = 1;
(ii) ω = 1, a = 0, and b = 100.
Comment on the results.
(Andĕl, 1997)
(a) Show that the exact two-step ahead MMSE forecast is given by
#
$
2
σe,2 = 2σε2 Φ zt+t|t + {φ21 Φ zt+1|t + φ22 Φ − zt+1|t }{Yt+1|t
2 2
+ σe,t+1 }
2
(c) Explore the limiting behavior of σe,2 as Yt → ±∞.
(De Gooijer and De Bruin, 1998)
i.i.d.
10.3 Consider predicting from a stationary AR(1) process Yt = φYt−1 + εt with {εt } ∼
N (0, 1) when the true process factually is the SETAR(2; 0, 0) process in Example 10.1.
(d) Obtain a value for the AR(1) parameter φ by equating the lag 1 autocorrela-
tions of the AR(1) process and the SETAR(2; 0, 0) process for α = 1.5. Next,
using part (c), plot Ratio-MSFE(H) versus Yt ∈ [−5, 5] for H = 1, 2, 3, and 5.
Comment on the shape of the line plots.
(Guo and Tseng, 1997)
10.4 With reference to Section 2.8.1, we recall that the EAR(1) model is defined as
αYt−1 with prob. α,
Yt =
αYt−1 + Et with prob. 1 − α,
where {Et } are i.i.d. exponentially distributed random variables with mean μ. Gaver
and Lewis (1980) show that Yt+j can be expressed as
(a) Using (10.80), show that the MSFE(H) of the least squares (LS) forecast is given
by
(b) Show that the MAFE of the one-step ahead LS forecast, denoted by MAFE(1),
is given by
10.5 With reference to the point forecast evaluation measures in Section 10.4.1:
(b) Verify the statement below (10.66) about the exactness of E(Var d) in the case
the process {dt , t ∈ Z} is WN.
10.6 Let {Yt }Tt=1 be an observed time series with T observations. Suppose that we have two
unbiased one-step ahead forecasts Y1,T +1|T and Y2,T +1|T , obtained from two different
models for time t = T + 1. The corresponding forecast errors are ei,T +1|T = YT +1 −
2 2
Yi,T +1|T (i = 1, 2). The one-step ahead forecast errors have variances σ1,e and σ2,e
with σ2,e ≤ σ1,e . The covariance between e1,T +1|T and e2,T +1|T is equal σ12 .
2 2
T +1|T = YT +1 − YT +1|T .
for some weight w. The corresponding forecast error is eC C
∗
(a) Show that Var(eC
T +1|T ) is minimal for w = w , with
2
σ2,e − σ12
w∗ = 2 2 − 2σ .
σ1,e + σ2,e 12
434 10 FORECASTING
(b) Let σC2 (w∗ ) denote the variance of the combined forecast error evaluated at w∗ .
Show that σC2 (w∗ ) ≤ σ2,e
2
, and thus σC2 (w∗ ) ≤ σ1,e
2
.
(c) How does the optimal weight w∗ , obtained via the combined forecast YTC+1|T ,
behave as a function of the correlation ρ12 = σ12 /σ1,e σ2,e using values ρ12 = 0
and ρ12 = ±1?
2 2
(d) In practice, the variances σ1,e , σ2,e are unknown. Also, the covariance σ12 is
unknown. How would you suggest to estimate the optimal weight w∗ ?
i.i.d.
where {εt } ∼ N (0, 1). Let {Yt }Tt=1 be a time series satisfying the above model.
Suppose that the ith observation (2 ≤ i ≤ T − 1) is missing from the series, but that
−i
{Yt }i−1
t=1 and {Yt }t=i+1 are known. Let Yt denote the vector of known observations,
T
and θ the vector of unknown parameters. Then show that the best minimum MSE
(MMSE) forecast for Yi is given by
where, for j = 1, 2,
(j)
c1 = 1/ 2π(σj2 + φj σ(j)
2 )
(j)
c2 = (φ(j) σj2 Yi−1 + φj Yi+1 )/(σj2 + φ2j σ(j)
2
)
(j)
c3 = σj σ(j) /2π(σj2 + φj σ(j)
2
)
(j)
c4 = exp{−(φ(j) σj2 Yi−1 + φj σ(j)
2
Yi+1 )2 /2σj2 σ(j)
2
(σj2 + φj σ(j)
2
)}
(j)
c5 = exp{−(Yi+1 − φj φ(j) Yi−1 )2 /2(σj2 + φj σ(j)
2
)},
with
10.8 Reconsider the SETAR(2; 1, 1) process in Exercise 10.1. Figures 10.9(a) and 10.9(b)
show the exact two-step ahead forecast function and the exact two-step ahead forecast
variance functions for two SETAR(2; 1, 1) processes each having a threshold at r = −2.
(a) Construct a tree diagram of all possible paths from Yt to Yt+2 . Explain qualit-
atively the maxima in the two variance functions.
EXERCISES 435
Figure 10.9: Two-step ahead forecast function (with ±2σe,2 ) and variance function (blue
solid lines) for the SETAR(2; 1, 1) process in Exercise 10.1 with (a) (φ1 , φ2 ) = (0.8, −0.4),
and (b) (φ1 , φ2 ) = (−0.8, 0.4); r = −2 (black solid vertical line) and σε2 = 1.
(b) Consider the SETAR process in Figure 10.9(a). Locate the two maxima of the
2
two-step ahead forecast variance function by solving dσe,2 /du = 0 numerically.
10.9 (a) Verify (10.26). In addition, using this result, prove that
(b) Suppose that ei,j represents the forecasting error for the jth-step ahead in the
ith replication (i = 1, . . . , 50; j = 1, . . . , 30). Analyze and compare the two
forecasting methods in terms of short-term (H = 5), medium-term (H = 15),
and long-term (H = 30) forecasting accuracy via the measures
1 1 2 1 1
50 H 50 H
MSFE(H) = e , and MAFE(H) = |ei,j |.
50 i=1 H j=1 i,j 50 i=1 H j=1
10.10 Consider the SETAR(2; 1, 1) model in Example 10.6. In addition to the empirical
coverage rate (CVR) given by (10.53), two other measures for evaluating the sharpness
436 10 FORECASTING
i.i.d. i.i.d.
FI0.95 {εt } ∼ N (0, 1) {εt } ∼ t5
CVR FI CVR FI
BFI (fitted residuals) 0.937 3.890 (0.375) 0.978 5.188(0.785)
BFI (predictive residuals) 0.947 4.060 (0.382) 0.983 5.416(0.802)
and resolution are the size and standard error of the length of the FIs, i.e. YT +H,i −
(1−α/2)
Y , where Y and Y
(α/2) (1−α/2) (α/2)
T +H,i T +H,i T +H,iare based on m MC replications.
10.11 Consider the river flow data set; see Examples 9.3 and 10.7. The file SCMI-HDR.dat
contains the last 35 observations of the river flow data set (column 1), the SCMI-lower
and upper FI (columns 2 – 3), and the HDR-lower and upper FI (columns 4 – 5).
Both FIs are shown in Figure 10.6, with coverage probability 1 − α = 0.9.
(a) Evaluate the two FIs using the test statistics LR uc , LRind , and LRcc . In the case
of LRuc and LRcc , take p = [0.5, 0.525, . . . , 0.95] (19 values).
(α)
(b) Test for independence of the process {it , t ∈ Z} using the test statistic (10.74).
Calculate the rejection frequency under the null hypothesis over 25,000 replic-
ations. Compare the outcome of the test with the test result of LR ind in part
(a).
10.12 Consider a certain strictly stationary and invertible time series process {Yt , t ∈
Z} whose ACF is identically zero. Therefore, it is reasonable to use Yt+H|t LS
=
E(Yt+H |Ys , −∞ < s ≤ t) = 0 (H ≥ 1) as the best (in the MSE sense) least squares
(LS) forecast of Yt+H . Yet, assume that in reality {Yt , t ∈ Z} is nonlinear. If this
fact is known, the forecast accuracy may be improved using a nonlinear (NL) forecast
based on a proper nonlinear model. This is a starting-point for the following forecast
comparison.
Suppose that the time series {Yt }Tt=1 is generated by the subdiagonal BL model
i.i.d.
Yt = ψYt−2 εt−1 +εt , where {εt } ∼ N (0, 1). The coefficient
√ ψ is assumed to be known,
and ψ satisfies the invertibility condition |ψ| < 1/ 2. Of course, the assumption that
the BL model is completely known is not very realistic in practice. However, under
EXERCISES 437
not too restrictive assumptions, Matsuda and Huzii (1997) show that the LS and NL
predictors with LS estimated parameters converge to their asymptotic values.
(a) Using your favorite programming language, write a computer code to obtain
estimates of the relative MSFE of the LS and NL forecasts for H = 1 and 2, and
ψ = −0.65, −0.55, . . . , 0.65. That is
LS NL
MSFE(Yt+H|t )/MSFE(Yt+H|t ), (H = 1, 2),
where the nonlinear two-step ahead forecasts are computed by the MC simulation
method in Section 10.2.1. Set the number of replications N = 100. Moreover,
set the number of MC replications at 2,000, and take T = 50.
In addition, consider the quadratic (Q) predictor introduced in (4.54). For the sub-
Q
diagonal BL model the one-step ahead forecast is given by Yt+1|t = ψYt−2 Yt−1 . This
approximation follows from replacing εt−1 by its definition εt−1 = Yt−1 − ψYt−3 εt−2
and ignoring the term containing ψ 2 in the subsequent expression. The two-step ahead
quadratic predictor can be obtained by MC simulation.
(b) Write a computer code to obtain estimates of the one- and two-step ahead MSFE
of the quadratic predictor using the same MC setup as in part (a). Compute
Q
MSFE(Yt+H|t NL
)/MSFE(Yt+H|t ) (H = 1, 2) for ψ = −0.65, −0.55, . . . , 0.65. Com-
pare the estimates of the relative MSFEs with those obtained under (a).
Chapter 11
VECTOR PARAMETRIC MODELS AND
METHODS
In this chapter, we extend the univariate nonlinear parametric time series framework
to encompass multiple, related time series exhibiting nonlinear behavior. Over the
past few years, many multivariate (vector) nonlinear time series models have been
proposed. Some of them are “ad - hoc”, with a special application in mind. Others
are direct multivariate extensions of their univariate counterparts. Within the latter
class, a definition of a multivariate nonlinear time series model is often proposed with
the following objectives in mind. First, the definition should contain the most general
linear vector model as a special case when the nonlinear part is not present. This is
analogous to univariate nonlinear time series models embedding linear ones. Second,
the definition should contain the most general univariate nonlinear model within its
class of models. Also, a potential candidate for a multivariate nonlinear time series
model should possess some specified properties in order to permit estimation of the
unknown model parameters and allow statistical inference. Moreover, because one
of the main uses of time series analysis is forecasting, it is reasonable to restrict
consideration to models which are capable of producing forecasts.
In Section 11.1, we give a general parametric multivariate nonlinear model in
the context of a vector Volterra series expansion, extending the discussion in Sec-
tion 2.1.1. However, with this specification an enormous range of possible models
emerges. The obvious way to avoid this problem is to impose some sensible restric-
tions on the structure of the model. This has led to a wealth of “restricted” vector
nonlinear models. Our treatment in Section 11.2 covers only a few of the most basic
ones. Each subsection provides a definition of the model, and discusses conditions
for stationarity and invertibility, if available. In contrast, we will not say much about
estimating these vector nonlinear models. In most cases, QML and CLS estimation
methods may be employed. In Sections 11.3 and 11.4, we then discuss a number of
time-domain test statistics for nonlinearity. Most of these tests are generalizations
of similar tests discussed in Chapter 5. In Section 11.5, we briefly address the prob-
lem of choosing the proper structure of a model using two model selection criteria.
(i = 1, . . . , m).
In practice, a truncated representation involving a finite number of parameters is
used to approximate this structure. In particular, the ith component of a vector BL
model results if all the coefficients of the second- and higher-order terms in (11.2)
equal zero. Furthermore, we introduce the m(p + q)-dimensional state vector St
defined by
St = (Yt , . . . , Yt−p+1
, εt , . . . εt−q+1 ) . (11.3)
Then we can define a multivariate SDM of order (p, q) which is locally linear, just
as in (2.10). Its ith component is given by
p
q
Yi,t = μi (St−1 ) + φi,j (St−1 )Yi,t−j + εi,t + θi, (St−1 )εi,t− , (i = 1, . . . , m).
j=1 =1
(11.4)
If all the parameters are constant, we have the ith component of the well-known
vector autoregressive moving average (VARMA) model. Clearly, an obvious gener-
alization of (11.1) is to allow for exogenous regressors in the function g(·).
11.2 VECTOR MODELS 441
m
p
m
q
m
P
Q
Yi,t = εi,t + φji,u Yu,t−j + j
θi,u εu,t−j + uv
ψi,k, Yk,t−u ε,t−v ,
u=1 j=1 u=1 j=1 k,=1 u=1 v=1
(11.5)
By introducing matrix notation and the Kronecker product, we can write the
system of equations defined by (11.5) in vector form as
p
q
P
Q
Yt = j
Φ Yt−j + εt + j
Θ εt−j + Ψuv {εt−v ⊗ Yt−u }. (11.6)
j=1 j=1 u=1 v=1
uv
Ψuv = (vec(ψ1uv )) , . . . , (vec(ψm )) .
Note that (11.5) involves P Qm2 + m(p + q) parameters, making it too general to
be of use in practice. As for the univariate BL model, special cases of (11.5) include
the
uv = 0, ∀u < v.
• subdiagonal case: ψi,k,
uv = 0, ∀u = v.
• diagonal case: ψi,k,
Stationarity
Stensholt and Tjøstheim (1987) give sufficient conditions for strict stationarity of
vector subdiagonal BL models, and obtain expressions for the mean and higher-
order autocovariance matrices.1 For simplicity, we assume that P = p and Q = q,
and q ≤ p. This is not an essential assumption, since it can be fulfilled by introducing
1
Our use of the term “subdiagonal” is in line with the definition given by Granger and Andersen
(1978a) and Stensholt and Tjøstheim (1987).
442 11 VECTOR PARAMETRIC MODELS AND METHODS
a suitable number of zero matrices. Now, we can rewrite (11.6) in a state space form.
That is
q
St = Fεt + ASt−1 + Cv [εt−v ⊗ Im(p+q) ]St−1 , (11.7)
v=1
where we define the m(p + q) × m matrix F and the m(p + q) × m(p + q) matrix A
as follows
⎛ ⎞ ⎛ ⎞
Im Φ1 ··· Φp Θ1 ··· Θq
⎜ 0m(p−1)×m ⎟ ⎜ Im(p−1) 0m(p−1)×m 0m(p−1)×m(q−1) ⎟
⎜ ⎟ ⎜ ⎟
F=⎝
Im
⎜
⎠, A = ⎝ 0m×m ··· 0m×m ⎠
⎟.
0m(q−1)×m 0mq×mp Im(q−1) 0m(q−1)×m
Cvi,j = {ψi,u,v
k
, 1 ≤ u ≤ m, 1 ≤ k ≤ p},
k
and where for simplicity we assume that ψi,u,v = 0 for k < in the sequel.
Following Stensholt and Tjøstheim (1987), we shall use the above matrices to
formulate a strictly stationary solution of (11.7). Let H = E[{εt ⊗ Im(p+q) } ⊗ {εt ⊗
Im(p+q) }]. We further introduce the m2 (p + q)2 × m2 (p + q)2 matrices Γv (1 ≤ v ≤ q)
defined by
Γ1 = A ⊗ A + (C1 ⊗ C1 )H,
v−1
# $
Γv = (Av−i Ci ) ⊗ Cv H(Ai−1 ⊗ Av−1 ) + (Cv ⊗ Cv )H(Av−1 ⊗ Av−1 )
i=1
v−1
# $
+ Cv ⊗ (Av−i Ci ) H(Av−1 ⊗ Ai−1 ), (2 ≤ v ≤ q),
i=1
where A0= Im(p+q) . Moreover, let L be the qm2 (p + q)2 × qm2 (p + q)2 matrix
defined by
Γ1 Γ2 ··· Γq
L= I(q−1)m2 (p+q)2 0(q−1)m2 (p+q)2 ×m2 (p+q)2
.
Then, if
equation (11.7) has a unique strictly stationary and ergodic solution (Stensholt and
Tjøstheim, 1987, Thm. 4.1) given by
j
∞
q
St = Fεt + A+ Cv [εt−v−r+1 ⊗ Im(p+q) ] Fεt−j , (11.9)
j=1 r=1 v=1
where the expression on the right-hand side of (11.9) converges absolutely almost
surely as well as in the mean for every fixed t in Z.
Liu (1989b) derives a sufficient condition for the existence of a strictly stationary
solution of the general vector BL model (11.6). The condition has the same form as
(11.8) except that the order and entries of the matrix Γj follow from another state
space representation than (11.7) with fewer dimensions. By assuming {εt } is an
i.i.d. sequence satisfying E(εi,t )2Q < ∞ (i = 1, . . . , m) and E(εt ) = 0, the condition
for strict stationarity reduces to (11.8).
A potentially useful result (Stensholt and Tjøstheim, 1987) for the identification
of vector superdiagonal BL models is that the autocovariance matrix of {St , t ∈ Z}
at lag ( > q) is given by
p
Cov(St , St− ) = Ai Cov(St−i , St− ), (11.10)
i=1
assuming the existence of the fourth moments of {εt }. Thus, the process (11.9)
has the same autocovariance structure as for a VARMA(p, q) process. This result
suggests that p and q selected by standard linear model selection techniques such
as AIC or BIC, can also serve as upper bounds on the lag orders P and Q in the
specification of BL models.
Invertibility
Here, we discuss invertibility of the process {Yt , t ∈ Z} given by (11.6) with P = p,
Q = q and q ≤ p. Define the mp × 1 vectors
St = (Yt−1 , . . . , Yt−p+1 ) , Ut = (εt−1 , . . . , εt−q+1 , 0 , . . . , 0 ) .
Now the process satisfying (11.6) is invertible (both by the classical concept of
invertibility and by the Granger–Andersen invertibility concept), if
#
q
$
exp E log
Θ + Ψ[St−v ⊗ Imp ]
< 1. (11.12)
v=1
# q
$
E
Θ + Ψ[Yt−j ⊗ Imp ]
< 1. (11.13)
j=1
It is clear that conditions (11.12) and (11.13) do not depend on the coefficients
of the linear VAR(p) submodel. Nevertheless, these conditions are hard to verify
in practice since they depend on the distribution of {Yt , t ∈ Z}. However, we can
replace (11.12) by a stronger condition which assumes only the existence of second
moments of {Yt , t ∈ Z}. As an example, we consider a multivariate BL model with
a single lag in the noise term and P = p. First, we define the m × 1 vectors
Then the representation of the multivariate BL model with just one lag in the noise
term is given by
p
#
p
$
Yt = Φi Yt−i + εt + Θv + Ψuv [Yt−u ⊗ Im ] εt−v , (11.14)
i=1 u=1
(v ∈ {1, . . . , q}; q ≤ p),
where
⎛ ⎞
uv
ψ1,1,1 · · · ψm,1,1
uv
· · · φuv
m,1,1 · · · ψm,1,m
uv
p
Θ
+
v
Ψuv
E
Yt
2 < 1, (v ∈ {1, . . . , q}; q ≤ p). (11.16)
u=1
where
⎛ ⎞ ⎛ ⎞ ⎛ ⎞
0.5 0 0.2 0.3 0 0 0.2 −0.1 0 0 0.1 0.3 0 0
⎜ 0 −0.7 ⎟ ⎜ 0.1 −0.5 0 0 ⎟ ⎜ 0.4 −0.3 0 0 −0.3 0.4 0 0⎟
F= ⎝ 0.5 0 ⎠ , A= ⎝ 0 0 0 0⎠, C1 = ⎝ 0 0 00 0 0 0 0⎠
.
0 −0.7 0 0 00 0 0 00 0 0 0 0
# $
The stationarity condition (11.8) becomes ρ (A ⊗ A) + (C1 ⊗ C1 )H < 1
with H = E[{εt ⊗ I4) } ⊗ {εt ⊗ I4) }], and by (11.16) the invertibility condition
becomes
Ψ11
E
Yt
2 < 1, where
11 0.2 −0.1 0.1 0.3
Ψ = 0.4 −0.3 −0.3 0.4 .
We can obtain the stationarity condition by simple calculation. Since E
Yt
2
is unknown, we replace the expression
1,000 for2the invertibility condition by the ap-
proximation
Ψ
(1,000) −1
t=1
Yt
.
11
When we fix the covariance matrix of the vector time series process {εt } at
Σε = I2 , the value of the stationarity condition equals 0.57, and the values of
the approximate invertibility condition are in the range (0.71, 1.19) with an
average of 0.89. When Σε = ( 0.5 2 0.5 ) the value of the stationarity condition is
2
0.70. On the other hand, the values of the approximate invertibility condition
are in the range (1.26, 1.62), so indicating that the process is non-invertible.
Figures 11.1(a) – (b) show the pattern of a typical realization of {Yt , t ∈ Z},
for each covariance matrix Σε . Overall these time series are rather stable in
both cases, with larger changes in the variance of {Yt , t ∈ Z} in Figure 11.1(b)
than in Figure 11.1(a). In general, the stationarity condition (11.8) works well
for a wide range of parameter matrices. However, one has to be careful in
using condition (11.16) since it seems to be too strong, i.e. the invertibility
domain is smaller than the exact invertibility domain. We discussed this point
earlier in Section 3.5 for the univariate case.
446 11 VECTOR PARAMETRIC MODELS AND METHODS
Figure 11.1: A typical realization of a bivariate BL process (T = 500; blue solid line Y1,t ,
2 0.5 ).
red solid line Y2,t ); (a) Σε = I2 , and (b) Σε = ( 0.5 2
where d > 0 is the threshold lag or delay parameter. Then, for an m-dimensional
strictly stationary time series process {Yt , t ∈ Z}, a VTARMA model of order
(k; p, . . . , p, q, . . . , q) is defined as
k
p
q
(i) (i) (i) (i)
Note that (11.17) is very general, in the sense that the regimes are defined by
arbitrary subspaces of Rm . However, identification of such regimes can be difficult in
practice. Tsay (1998) discusses a VTAR model in which values of a single exogenous
variable X1,t−d ≡ Xt−d are used to determine the different regimes. That is, with
R(i) = (ri−1 , ri ], where −∞ = r0 < r1 < · · · < rk−1 < rk = ∞, (11.17) simplifies to
k
p
(i) (i)
Yt = Φ0 + Φ(i)
u Yt−u + εt I(Xt−d ∈ R(i) )
i=1 u=1
k p
(i) (i) (i−1) (i)
= Φ0 + Φ(i)
u Yt−u + εt (It−d − It−d ), (11.18)
i=1 u=1
where
(i−1) (i) (0) (k)
(It−d − It−d ) ≡ I(Xt−d > ri−1 ) − I(Xt−d ≥ ri ), (It−d = 1, It−d = 0).
Stationarity
To present stationarity conditions for the VTARMA process, we first define the
m(p + q)-dimensional vector Ut = (εt , 0m(p−1)×1 , εt , 0m(q−1)×1 ) . We set ω =
(1, . . . , 1) and Φ0 = 0, ∀i, in (11.17). We also need the state space vector
(i)
with
(i) (i) (i) (i)
0m×m · · · 0m×m
Φ11 = Φ1 · · · Φp Φ12 = Ψ1 ··· Ψq
(i) (i) (i)
, , Φ22 = Im(q−1) 0m(q−1)×m
.
Im(p−1) 0m(p−1)×m 0m(p−1)×mq
448 11 VECTOR PARAMETRIC MODELS AND METHODS
Observe that (11.19) is identical to the SRE in (3.1) with At ≡ Φ(i) if Xt−d ∈ R(i)
(i = 1, . . . , k). After s iterations, and similar to (3.4), (11.19) can be written as
s
s
i−1
St = At−i St−s−1 + At−j Ut−i , ∀s ∈ N, (11.20)
i=0 i=0 j=0
where −1j=0 At−j = Im . Then, under some mild conditions, Niglio and Vitale (2014)
show that the process {St , t ∈ Z} is strictly stationary and ergodic if
k
ρ(Φ(i) )pi < 1, (11.21)
i=1
where pi = E[I(Xt−d ∈ R(i) )] < 1 with ki=1 pi = 1, and ρ(Φ(i) ) is the dominant
eigenvalue (or spectral radius) of Φ(i) (i = 1, . . . , k).
Invertibility
Consider, as a special case of (11.17), the VTMA(k; q, . . . , q) model
k
q
v εt−v I X t−d ∈ R
Ψ(i) (i)
Yt = + εt . (11.22)
i=1 v=1
k
St = Ut + Ψ(i) Ut−1 I(Xt−d ∈ R(i) ), (11.23)
i=1
Then, under some mild conditions, Niglio and Vitale (2013) show that (11.23) is
globally invertible if
k
ρ(Ψ(i) )pi < 1. (11.24)
i=1
11.2 VECTOR MODELS 449
Further, they show that under condition (11.24), the VTMA(k; q, . . . , q) model can
be written as a VTAR of infinite order with conditionally time dependent paramet-
ers, i.e.
∞
εt = Yt + Πj,t Yt−j , (11.25)
j=1
where
q
k
()
Πj,t = − Πj−i,t Ψi I(Xt−d−(j−i) ∈ R() ) ,
i=1 =1
(J) 100
Figure 11.2: (a) – (b) Regime-specific realizations of {Yj,t }t=1 (j = 1, 2; J = 1, 2)
(J) (J)
obtained from model (11.27) with T = 7,000; (c) – (d) Sample CCFs for {(Y1,t , Y2,t )}Tt=1
J
(T1 = 3,539 and T2 = 3,460) with 95% asymptotic confidence limits (blue medium dashed
lines).
where
(1) 0.8 0.5 (2) 0.5 0.3
{εt } ∼ N (0, I2 ).
i.i.d.
Φ1 = , Φ1 = ,
0 0 0 0
(J)
Figures 11.2(a) – (b) show plots of the series {Yj,t }100
t=1 (J = 1, 2; j = 1, 2) for
each regime obtained as subseries from two typical regime-specific realizations
of length T1 = 3,539 and T2 = 3,460 respectively. Both plots provide informa-
(J)
tion of a possible feedback relationship from earlier values of Y2,t (black solid
(J)
lines) to Y1,t (blue solid lines). The sample CCFs in Figures 11.2(c) – (d)
11.2 VECTOR MODELS 451
= −1, as indicated by
support this observation with significant values at lag √
the Bartlett 95% asymptotic confidence limits ±1.96/ TJ .2
Unfortunately, Bartlett’s confidence limits are no longer valid for nonlinear
DGPs as we have remarked earlier. Thus, it may be safer to follow another
route to detect whether one time series is leading another. In particular,
Granger’s causality concept, or rather its opposite, Granger non-causality,
may be used for this purpose. The concept is well known in the context of
VAR models; see Section 12.5 for a definition. Evidently, for (11.27) a Granger
(J)
causality test based on the parameter restriction φ1;1,2 = 0 ∀J is no longer
sufficient due to across-regime interactions.
An immediate but approximate solution is to compute a Granger caus-
ality measure for each regime J separately; Leistritz et al. (2006). Let
Yt−u = (Y1,t , . . . , Yu−1,t , Yu+1,t , . . . , Ym,t ) , where the superscript −u denotes
the uth variable in Rm , with corresponding restricted information
omission of
−u
set Rj = R1 ,...,Ru−1 ,Ru+1 ,...,Rm Rj (j = 1, . . . , m). Further, for each regime J
(J)
(J = 1, . . . , kmax ), let ej,t+1 |Yt−d denote the one-step ahead forecast error for
(J)
Yj,t+1 (j = 1, . . . , m) conditional on Yt−d ∈ Rj when the forecast is given by
−u (J)
the conditional mean. In addition, let ej,t+1 |Yt−d denote the one-step ahead
−u
∈ Rj−u with sim-
(J)
forecast error for Yj,t+1 (i = 1, . . . , m) conditional on Yt−d
(J) (J)
ilar properties. Then Yu,t does not Granger cause in variance Yj,t (j = u),
(J) (J)
denoted by Yu,t Yj,t , if and only if
V
(J)
(J) 2 −u
ej,t+1 ) |Yt−d
E (
(J)
γu→j = log (J)
. (11.29)
E (ej,t+1 )2 |Yt−d
(J) (J)
In practice, we replace γu→j by an estimate γu→j using consistent estimates
(J) 2
(J) 2 −u
(J)
of E (ej,t+1 ) |Yt−d and E (
ej,t+1 ) |Yt−d . Thus, if the series Yu,t does not
(J) (J)
u→j will be close to zero. Any improvement
improve the prediction of Yj,t+1 , γ
(J) (J)
in prediction of Yj,t+1 by the inclusion of Yu,t in the information set leads to
(J)
u→j .
an increase in γ
(J)
u→j in the case of the two-regime
In order to evaluate the performance of γ
(J)
VSETAR(2; 1, 1) model (11.27) we bootstrapped the EDF of γ u→j (100 BS rep-
licates) and computed 95% critical values for each regime. Next, based on 500
2
If {Xt }Tt=1 and {Yt }Tt=1 are two time series normalized
−to have zero-mean and unit-variance,
their lag sample CCF is given by cXY () = (T − )−1 Tt=1 Xt+ Yt ( = 0, 1, 2, . . .).
452 11 VECTOR PARAMETRIC MODELS AND METHODS
where ΔYt ≡ Yt − Yt−1 is I(0), a and α are both m × 1 parameter vectors, Ai are
m × m matrices of coefficients, and {εt } is a sequence of i.i.d. random variables with
mean zero and positive definite covariance matrix Σε , independent of Yt . If the
time series are not cointegrated, then a VAR in ΔYt with p − 1 lags is appropriate.
The partition of the matrix αβ in (11.30) is not unique, a convenient normalization
condition is to set one element of β equal to unity.
Threshold (i)
Assume ( pu=1 Φu − Im ) has rank m − 1. Then, after rearranging some terms, we
can write (11.17) as a k-regime threshold vector error correction (TVEC) model:3
3
The acronym TVEC is commonly used in the literature. Adopting the short-hand notation
VTEC would have been more in line with abbreviations introduced in Sections 11.2.2 and 11.2.3.
11.2 VECTOR MODELS 453
k
p−1
φ0 + α(i) (β (i) ) Yt−1 + I(β Yt−1 ∈ R(i) ),
(i) (i)
ΔYt = A(i)
u ΔY t−u + εt
i=1 u=1
(11.31)
where pu=1 Φu − Im = α(i) (β (i) ) , with α(i) and β (i) m × 1 vectors, and Au =
(i) (i)
p (i)
− j=u+1 Φj . Note that the delay d is set at one with Xt−1 ≡ Yt−1 . Making the
delay a part of the set of unknown parameters is in principle possible, but would
make the estimation and identification process much more involved.
Model (11.31) implies that there exists a regime-specific stationary equilibrium
solution. To achieve identification of (11.31), some normalization must be imposed
on β and β (i) (i = 1, . . . , k). In the bivariate case, we recommend to do this by
setting one element of these vectors equal to one. Note that if in regime i (β (i) ) Yt−1
is I(0), then the threshold variable β Yt−1 will not be stationary when β (i) = β.
Estimation of TVEC models can be performed by recursive CLS, assuming that
the order of the model and the value of the threshold cointegration parameters
are known. Another way to proceed is by adopting a QML procedure. Third,
two-stage LS can be used in a conditional way; see De Gooijer and Vidiella–i–
Anguera (2005) for a finite-sample comparison of these estimation procedures. El-
Shagi (2011) compares various genetic algorithms to optimize the likelihood function
of TVEC models. It is beyond the scope of this book to discuss these and other
estimation methods in detail.
Yt = Φ0 + Φ(i)
u Y t−u Gt − Gt + εt
i=1 u=1
(i)
k
(Φ(i) ) Zt + εt ,
(i−1)
= Gt − Gt (11.32)
i=1
Φ(i) = (Φ0 ) , (Φ1 ) , . . . , (Φ(i)
(i)
p ) ,
(i) (i)
and where Gt ≡ G(Xt ; γ (i) , c(i) ) is an m × m diagonal matrix of transition func-
tions
(i) (i) (i) (i) (i)
, cm )}, (i = 1, . . . , k − 1), (11.33)
(i) (i)
Gt = diag{G(X1,t ; γ1 , c1 ), . . . , G(Xm,t ; γm
454 11 VECTOR PARAMETRIC MODELS AND METHODS
(c1 , . . . , cm ) (the location parameters), and γj > 0, ∀i, j. The sequence {εt } is an
(i) (i) (i)
m-dimensional vector WN process with mean zero and m×m positive definite covari-
ance matrix Σε , independent of Yt . The transition variable Xt = (X1,t , . . . , Xm,t )
(i) (i) (i)
(i = 1, . . . , k − 1) can take many forms, for example a lagged variable of one of the
components of {Yt , t ∈ Z}, a linear combination of the m series, a weakly stationary
exogenous variable, or a deterministic time trend.
When k = 2, (11.32) becomes
# $
Yt = (Im − Gt )(Φ(1) ) + Gt (Φ(2) ) Zt + εt
(1) (1)
p
p
= Φ0 + Φu Yt−u + Φ0 + u Yt−u G(Xt ; γ, c) + εt ,
Φ (11.34)
u=1 u=1
Stationarity
(i)
The transition functions Gt are continuous and bounded between 0 and 1 for all
(i)
values of Xt (i = 1, . . . , k − 1). This implies that the VSTAR model has the
same stability condition as the linear VAR model. Unfortunately, explicit necessary
and sufficient conditions for weak stationarity of LVSTAR models are not available
yet. Nevertheless, a “rough-and-ready” check for stationarity of nonlinear models
in general is to determine whether the skeleton is stable, using MC simulation. If
the skeleton is such that the observed vector time series tends to explode for certain
initial values, the process is likely to be nonstationary.
11.2 VECTOR MODELS 455
where ΔYt is I(0), the m × 1 vectors α(i) and β (i) (i = 1, 2) are as in (11.30),
and G(·) is an m × m diagonal matrix defined in (11.33). One way of keeping the
computational aspects tractable, is to assume that the transition variables as well
as the transition functions in (11.35) are the same for each model equation. In that
case, G(Xt ; γ, c) = G(Xt ; γ, c)Im .
Stationarity
Saikkonen (2005, 2008) considers conditions for stationarity and ergodicity of a gen-
eral three-regime nonlinear error correction model that encompasses the VSTEC
model. The m-dimensional process {Yt , t ∈ Z} is transformed to a process {Zt , t ∈
Z} which can be viewed as a Markov chain. The Markov chain Zt is geometrically er-
godic when the joint spectral radius of a (finite) set A ⊂ Rmp×mp of square matrices
is less than one. The set A consists of companion matrices defined through the
transformed representation of Yt . If A only contains a single matrix then the joint
spectral radius ρ(A) (see (B.7) for its definition) coincides with the spectral radius of
a square matrix. Clearly, the condition ρ(A) < 1 is hard to verify analytically. An
alternative method is to use one of the many algorithms for approximating the joint
spectral radius; see Chang and Blondel (2013) for an overview and a comparison of
these algorithms.
∗
Figure 11.3: (a) Three nonstationary, I(1), time series {Y1,t }, {Y2,t } and {Y3,t } of length
T = 200; (b) A stationary nonlinear combination of the time series {Y2,t } and {Y3,t } in plot
(a).
the individual component series. 4 These features may, for instance, be a common
stochastic trend as with linear cointegration; see the brief exposition in the preamble
of Section 11.2.4. A less restrictive specification arises under the assumption that the
cointegrating vector β = (β1 , . . . , βm ) is not a constant; e.g. β depends on time t, or
β is assumed to be a vector of random variables. This prompted Li and He (2012a)
to propose the following definition. The vector time series process {Yt , t ∈ Z} is
said to contain smooth transition cointegration if there exists an m × 1 time-varying
vector βt = (β1,t , . . . , βm,t ) such that the nonlinear combination of Yt is I(0), that
is
where βi,t = βi G(Xt ; γ, c), and G(·) is a logistic transition function given by
1
G(Xt ; γ, c) = q , (11.37)
1 + exp{−γ j=1 (Xj,t − cj )}
Y1,t = β2,t Y2,t + β3,t Y3,t + ε1,t , Y2,t = Y2,t−1 + ε2,t , Y3,t = Y3,t−1 + ε3,t ,
−1
−1
where β2,t = −0.8 1 + exp{−2(Xt − 0.3)} , β3,t = 1 + exp{−(Xt − 1)} ,
{εt = (ε1,t , ε2,t , ε3,t ) } ∼ N (0, I3 ), and {Xt } ∼ N (0, 1). Both {Y2,t , t ∈ Z}
i.i.d. i.i.d.
4
A feature that is present in each group of individual time series is said to be common to those
series if there exists a non-zero linear combination of the series that does not have the feature; Engle
and Kozicki (1993).
11.2 VECTOR MODELS 457
and {Y3,t , t ∈ Z} are random walks, or I(1) processes. Their linear combina-
∗ = Y
tion, Y1,t 2,t + Y3,t is also a nonstationary, or I(1) process. Figure 11.3(a)
shows plots of the three series Y1,t ∗ , Y
2,t and Y3,t over a sample period of length
T = 200. Figure 11.3(b) shows a plot of the stationary nonlinear combination
{Y1,t }200
t=1 with βt = (β2,t , β3,t ) the time-dependent cointegration vector.
A g(Zt ; θ) = 0. (11.39)
Clearly, (11.39) implies that g∗ (·) = −(A∗∗ ) g∗∗ (·), an r × 1 vector. Moreover,
it implies the following relation:
−(A∗∗ )
g(Zt ; θ) = g∗∗ (Zt ; θ).
Im−r
Figure 11.4: (a) Two stationary nonlinear time series with a single CNF; (b) A stationary
linear combination of the time series in plot (a).
(iii) Regress εt from step (i) on Wt from step (ii). Compute the corresponding
m × m sum of squared regression matrix, SSR, and the sum of squared error
matrix, SSE.
where
(iii) Regress εt from step (i) on the vector of residuals Ut from step (ii). Compute
the corresponding m × m sum of squared regressions matrix, SSR2 , and the
sum of squared errors matrix, SSE2 . Let SSR2|1 = SSR2 − SSR1 , i.e.
SSR2|1 is the extra sum of squares due to the addition of the second-order
terms to the model.
where
(T ) D
FT,p (m) −→ Fν1 ,ν2 , as T → ∞, (11.48)
with ν1 = m and ν2 = T − p − mp − m.
may be noted that for m = 1, the degrees of freedom of ν1 and ν2 are nearly the
same as those reported in Algorithm 5.7; recall that E(Yt ) = 0 while in Algorithm
5.7 the univariate second-order Volterra expansion has a non-zero mean. Clearly,
computation of (11.46) requires fewer degrees of freedom; i.e. the response variable
is an m-variate vector as compared to an νU = mp(mp + 1)/2-variate vector in
Algorithm 11.1. This may be preferable for short series.
Original F test
(O)
The multivariate generalization of the FT test statistic (Algorithm 5.8) employs
disaggregated variables in step (ii) of Algorithm 11.2. The test statistic is based on
the following model
p
Yt = Φj Yt−j + Ψ vech(Zt ⊗ Zt ) + εt , (11.49)
j=1
, . . . , Y ) is an mp × 1 vector, and Ψ is an m × mp(mp + 1)/2
where Zt = (Yt−1 t−p
parameter matrix. Thus, the null hypothesis of interest is given by H0 : Ψ = 0. The
computation of the corresponding test statistic goes as follows.
(O)
Algorithm 11.3: FT test statistic for nonlinearity
(i) Follow step (i) of Algorithm 11.2.
(iii) Regress εt from step (i) on Wt from step (ii). Compute the m × m sum
of squared regressions matrix, SSR2 , and the sum of squared errors matrix,
SSE2 . Let SSR2|1 = SSR2 − SSR1 .
T − p − 1 mp(mp + 3)
1 − Λ(W)
1/2
(O ) 2
FT,p (m) =
νU
1/2 , (11.50)
Λ ( W)
where
Under H0 ,
(O ) D
FT,p (m) −→ Fν1 ,ν2 , as T → ∞, (11.52)
with ν1 = νU and ν2 = T − p − 1
2 mp(mp + 3) .
462 11 VECTOR PARAMETRIC MODELS AND METHODS
Figure 11.5: Annual temperatures (Y1,t ) and tree ring widths (Y2,t ) for the years 1907 –
1972 (T = 66) at Campito Mountain, California.
Harvill and Ray (1999) also consider a semi-multivariate version of the test stat-
istics in Algorithm 11.3 in which each component of the vector series is regressed in-
dividually on Ut in step (ii). The individual test statistics for this semi-multivariate
version have a simple F distribution under the null hypothesis of linearity with
ν1 = mp(mp + 1)/2 and ν2 = T − mp(mp + 3)/2 degrees of freedom. In this case,
however, possible cross-correlation in the error terms is not accounted for by the
procedure. On the other hand, the semi-multivariate test may be more powerful
when only one of the component series of {Yt , t ∈ Z} is nonlinear.
The Wilks’ Λ(W) test statistics in Algorithms (11.1) – (11.3) are formulated as
LR-type tests. Other test statistics can be defined directly in terms of the sum of
squared errors matrix SSE and the sum of squared regression matrix SSR, or in
terms of their non-zero eigenvalues; see, e.g., Johnson and Wichern (2002, Chapter
7). Two well known multivariate test statistics are the Hotelling–Lawley (HL) trace
test statistic and Pillai’s (P) trace test statistic, respectively defined by:
The test statistic (11.53) is valid when SSE is positive definite. The test statistic
(11.54) requires a less restrictive assumption: SSR+SSE is positive definite. Wilks’
lambda and the Hotelling–Lawley trace test statistics are nearly equivalent for large
sample sizes.
The test statistic (11.50) can be extended to include cubic terms, as in the
(A)
augmented FT test statistic of Section 5.4. However, the proliferation of additional
terms in the multivariate case is expected to result in a loss of power due to fewer
degrees of freedom for the F test statistic, unless m is small and T is large. Also, as
in the univariate case, a VARMA(p, q) model can be fit to the data initially (using,
e.g., QML estimation) to allow for linear MA structure. In that case the test statistic
, . . . , Y , ε t−q ) , where
(11.50) is modified by letting Zt in step (i) be (Yt−1 t−p t−1 , . . . , ε
εt denotes the series of residuals from the VARMA fit.
11.3 TIME-DOMAIN LINEARITY TESTS 463
Table 11.1: Values of the multivariate nonlinearity test statistics for the annual temper-
atures (Y1,t ) and tree ring widths (Y2,t ) time series; T = 66, p = 4, and m = 2.
Degrees of freedom
Test Wilks HL-Trace P-Trace Num. Den. p-value p-value p-value
(O)
FT,p (m) 0.042 8.888 1.539 36 18 0.028 0.023 0.047
(T)
FT,p (m) 0.572 0.747 0.429 2 52 0.000 0.000 0.000
(O)
FT (Y1,t ) 3.191 10 52 0.003
(O)
FT (Y2,t ) 3.146 10 52 0.003
Semi (Y1,t ) 2.691 36 22 0.008
Semi (Y2,t ) 1.358 36 22 0.227
The rings of trees in certain cites of western North America provide a unique
source on past variations of climatic and other environmental factors which
prevail over North America and the adjoining oceans. Figure 11.5 shows plots
of annual temperatures (in ◦ F) and annual tree ring widths (in 0.01 mm)
measured at Campito Mountain in California for the years 1907 – 1972 (T =
66). Below, we use this data set as an illustration of the nonlinearity test
statistics discussed above.
The sample ACF and PACF matrices both identify an association between
tree ring widths in year t (Y2,t ) and tree ring widths one, three, and four years
back, while changes in temperature (Y1,t ) are associated with the previous
year’s tree growth; cf. Exercise 12.2. So, as a first step, we fitted a VAR(4)
model to the data. Next, we computed the test statistics in Algorithms 11.2
and 11.3 using appropriate versions of Wilks’ lambda statistic, the HL test
statistic, and the P test statistic. In addition, based on the Wilks’ lambda
(O)
statistic, we applied the semi-multivariate version of the FT,p (m) test statistic
(O)
and its univariate analogue, FT (Algorithm 5.8).
Table 11.1 contains the values of the test statistics, p-values, and degrees of
(O)
freedom. The p-values for the multivariate nonlinearity test statistics FT,p (m)
(T)
and FT,p (m), for the Wilks’ lambda statistic, the HL test statistic, and the
P test statistic, all indicate that the null hypothesis of linearity should be
rejected at the 5% nominal significance level. The same conclusion emerges
(O)
for each series from the p-values of the FT test statistic based on the Wilks’
lambda test statistic. On the other hand, the p-value of the semi-multivariate
version of Tsay’s original test statistic does not reject linearity for the tree ring
widths Y2,t . However, as stated above, the semi-multivariate test statistics do
not account for significant, at the 5% nominal level, sample cross-correlations
between the time series {Y1,t } and {Y2,t }.
464 11 VECTOR PARAMETRIC MODELS AND METHODS
Zτ +d
ετs+1 +d = Yτs+1 +d − Φ s s+1
(iii) Regress
eτ +d on Zτ +d ( = nmin + 1, . . . , T − h).
CT,p (d, m) = [T − h − nmin − (mp + 1)]{log |SSE0 | − log |SSE1 |}, (11.57)
11.4 TESTING LINEARITY VS.SPECIFIC NONLINEAR ALTERNATIVES 465
T −h
T −h
1 1
SSE0 = eτ +d ,
eτ +d SSE1 = τ +d
τ +d ω
ω
T∗ T∗
=nmin +1 =nmin +1
(v) Under the null hypothesis that {Yt , t ∈ Z} follows a strictly stationary
VAR(p) process, and some regularity conditions, Tsay (1998) shows that
D
CT,p (d, m) −→ χ2m(mp+1) , as T → ∞. (11.58)
The test statistic has good power when the delay d is correctly specified; Tsay
(1998). The power deteriorates when the delay used in the test is different from
the actual delay. Note, H0 includes a zero intercept for all predictive residuals. In
theory, a non-zero intercept signifies a systematic bias in the estimation of (11.56),
indicating possible change points. So, due to the possibility of finite-sample bias, one
may wish to exclude the intercept term from the nonlinearity test statistic (11.57)
which can be achieved by mean-correcting SSE0 . In this case, the resulting test
statistic has an asymptotical χ2m2 p distribution under the null hypothesis.
p
p
Yt = Φ0 + Φi Yt−i + Ψ0 + Ψi Yt−i I(Xt−d ≤ r) + εt , (11.59)
i=1 i=1
with F t the information set, and Σε is a positive definite matrix. It is also assumed
that p and d are unknown, and that r belongs to a known bounded subset R = [r, r]
of R.
466 11 VECTOR PARAMETRIC MODELS AND METHODS
H0 : Ψv(U) = 0, v
H1 : Ψ(U)
= 0, for some r ∈ R. (11.63)
with ηv = Yv −(Im ⊗X)Φ v a vector of residuals. Similarly, given (11.61), the CLS
(R)
v and Ψv , and the corresponding
estimates of the unrestricted parameter vectors Φ(U) (U)
estimate of Σε are respectively given by
# $
v = Im ⊗ (X X)−1 X [IT −p − Yr G−1 Y (IT −p − PX )] Yv ,
Φ (U) r
# $
Ψ v = Im ⊗ G−1 Y (IT −p − PX ) Yv and Σ ε = ε ε/(T − p),
(U) r
11.4 TESTING LINEARITY VS.SPECIFIC NONLINEAR ALTERNATIVES 467
where PX = X(X X)−1 X , G = Yr (IT −p − PX )Yr , and εv = Yv − (Im ⊗ X)Φ v −
(U)
v
(Im ⊗ Yr )Ψ(U) .
Now, let Λ −1/2
η = (Σ ε
−1/2
ε = (Σ
⊗ IT −p )ηv and Λ ε εv be the rescaled
⊗ IT −p )
residual vectors. Then the LR statistic for testing H0 against H1 is defined in terms
of the residual sum of squares matrices as
# $
Λ
LRT,p (m, r)= sup Λη η − Λε Λε ,
r∈R
# $
= sup −1/2 ⊗ Y (IT −p − PX )]Yv [IT −p ⊗ G−1 ]
[Σ ε r
r∈R
# −1/2
v$
[Σ ε ⊗ Y r (IT −p − PX )]Y , (11.65)
where the second expression on the right-hand side follows from some simple algebra.
(8)
Note that for a fixed r and m = 1, (11.65) reduces to the LR test statistic LR T
defined by (5.55).
The asymptotic null distribution follows in a similar way as described in Section
5.2. Suppose the following assumption holds
1 X X X Yr a.s. Σ Σ12 (r)
−−−−→ ,
T Yr X Yr Yr T →∞ Σ21 (r) Σ22 (r)
where Σ(·), Σ21 (·) = Σ12 (·), and Σ22 (·) are (mp + 1) × (mp + 1) matrices. Under
H0 , standard regularity conditions, and as T → ∞, it can be shown (Liu, 2011) that
where
−1
Ω(r) = Im ⊗ Σ21 (r) − Σ21 (r)Σ−1
22 (r)Σ12 (r) ,
and {G 2(mp+1) (r)} ∼ N 2(mp+1) 0, Im ⊗ (Σ(r∧s) − Σ21 (r)Σ−1 Σ12 (r)) distributed.
Then, for large α, and using the Poisson clumping heuristic method, we have
α
P(sup G 2(mp+1) (r)Ω1 (r)G 2(mp+1) (r) ≤ α) ∼ exp − 2χ2m(mp+1) (α) −1
r∈R mp + 1
mp+1
× ti (r) − ti (r) , (11.67)
i=1
where χ2 (·)m(mp+1) denotes the pdf of the χ2 distribution with m(mp + 1) de-
#
$
grees of freedom, ti (r) = 12 log Li (r)/ 1 − Li (r) ∀i, and Li (r) are eigenvalues
−1/2 −1/2
of Σ21 (r)Σ22 (r)Σ12 (r). Appendix 11.A contains a table with selected percent-
iles of the LR-VATR test statistic when m = 2.
468 11 VECTOR PARAMETRIC MODELS AND METHODS
Also, let θ denote the vector of available parameters. As {εt } ∼ N m (0, Σε ), the
i.i.d.
T
log LT (θ) = −(1/2) (Yt − Ψt B Zt ) Σ−1
ε (Yt − Ψt B Zt ),
t=1
11.4 TESTING LINEARITY VS.SPECIFIC NONLINEAR ALTERNATIVES 469
where PX = X(X X)−1 X , B 1 and Σ ε are parameter estimates under the null
hypothesis, i.e. the restricted model specification. Here, the superscript (1) indicates
that the test is based on the first-order Taylor expansion of the logistic transition
function. Similar as Algorithm 5.1, the test procedure consists of the following steps.
(1)
Algorithm 11.5: LMT,p (m)-type test statistic for LVSTAR
(i) Fit a VAR(p) model to {Yt }Tt=1 using, e.g., CLS or NLS. Obtain the T × m
= (IT − PX )Y, and compute the corresponding sum
matrix of residuals E
of squared errors matrix, SSE0 = E
E.
= (
(ii) Regress E ε1 , . . . , εT ) on (X, U), i.e. an auxiliary regression. Obtain the
matrix of residuals Ξ and compute the corresponding sum of squared errors
matrix, SSE1 = Ξ Ξ.
Under H0 , it easy to show (Teräsvirta and Yang (2014a) and Exercise 11.3)
that
(1) D
LMT,p (m) −→ χ2m(mp+1) , as T → ∞. (11.72)
(1)
The LMT,p (m)-type test statistic can also be used to help select an appropriate
transition variable Xt by computing the statistic for various Xt ’s and selecting the
one for which the p-value of the test statistic is smallest. MC simulation studies
(e.g., Teräsvirta and Yang, 2014a) indicate that the power of the above test is good
when the transition variable is correctly specified.
In small samples, it is recommended to compute an F -version of the test statistic
(11.71) to improve its empirical size. Also, Bartlett and Bartlett-type corrections
have been suggested. One is the so-called Laitinen–Meisner correction, which is a
simple degrees of freedom rescaling of an LM-type test statistic. Within the setup
470 11 VECTOR PARAMETRIC MODELS AND METHODS
Table 11.2: Values of the multivariate nonlinearity test statistics for the tree ring widths
data set; T = 66, p = 4, and m = 2 (p-values are given in parentheses).
1 21.587 (0.251) 39.263 (0.006) 1.384 (0.153) 40.565 (0.004) 2.297 (0.005)
2 13.758 (0.745) 35.613 (0.017) 1.255 (0.232) 33.969 (0.026) 1.852 (0.028)
3 34.222 (0.012) 37.997 (0.009) 1.339 (0.178) 36.078 (0.015) 1.991 (0.016)
4 23.655 (0.167) 34.876 (0.021) 1.229 (0.252) 34.134 (0.025) 1.863 (0.026)
5 19.315 (0.373) 34.622 (0.022) 1.220 (0.258) 34.138 (0.025) 1.863 (0.026)
four test procedures considered here. The test results are summarized in Table
11.2, columns 3 – 6. We see that the test statistics attain their largest value
(1)
for delay d = 1. Except for FT,p (m), the p-values of the test statistics are all
close to zero. Thus, the H0 of linearity is rejected against the alternative of
LVSTAR nonlinearity.
where Σ (i)
ε is an estimate of the residual covariance matrix in each regime i (i =
1, . . . , k).
Clearly, the explosion of parameters for VSETARMA models can be problem-
atic in practice. Therefore, one often restricts the number of regimes k to a small
number such as 2 or 3 to keep the analysis manageable. In addition, it is useful
to divide the available multivariate data set into subsets according to the empirical
percentiles of {Xt }Tt=1 , and adopt a vector time-domain nonlinearity test statistic to
detect any model change within each subset. This approach may also provide some
tentative information on the location of the threshold intervals R(i) (i = 1, . . . , k).
Moreover, in the case of VSETAR model identification, a regression subset selection
method based on GAs may be considered as an attractive, and easily implemented,
alternative; see Baragona and Cucina (2013).
Evidently, for an m-dimensional VSTAR(k; p, . . . , p) model with a single trans-
ition variable Xt−d , we can use both AIC and BIC. Then the regime-specific number
(i−1) (i)
of observations is not necessarily an integer, i.e. Ti = Tt=h+1 (Gt−d − Gt−d ), where
(i)
Gt−d ≡ G(Xt−d ; γ (i) , c(i) ) is the transition function corresponding to the ith regime,
(0) (k)
Gt−d = 1 and Gt−d = 0.
472 11 VECTOR PARAMETRIC MODELS AND METHODS
Asymptotics
Let {Yt , t ∈ Z} be a stationary and ergodic m-dimensional stochastic process defined
by the nonlinear model
where F t−1 represents the information set generated by {Ys , s < t}, g(·; θ0 ) is
a known real-valued measurable function on Rm , and θ0 denotes the true, but un-
known, value of the K ×1 parameter vector θ. The vector function g(·; ·) is supposed
to have continuous second-order derivatives with respect to θ a.s. The process {εt }
is an m-dimensional vector martingale difference sequence satisfying (11.60).
Let {Yt }Tt=1 be a finite set of realizations of the process {Yt , t ∈ Z}. Given a
vector of initial values, the CLS estimator θT of θ0 is obtained by minimizing the
sum of squared errors
T
where
∂g ∂gt−1
H(θ 0 ) = E t−1
Σ−1 ,
∂θ ε
∂θ
∂g ∂gt−1
t−1 −1
I(θ 0 ) = E Σε Yt − gt−1 Yt − gt−1 Σ−1 ,
∂θ ε
∂θ
11.6 DIAGNOSTIC CHECKING 473
with gt−1 ≡ g(F t−1 ; θ0 ). Let Γε () = Cov(εt , εt− ) be the lag theoretical autoco-
variance matrix. Its sample analogue is defined as
−1 T
Cε () =
T t=+1 εt εt− , ( = 0, 1, . . . , T − 1), (11.79)
Cε (−), ( = −1, . . . , −T + 1).
T cε −→ NM m2 0, ΔM ,
where
From standard matrix differentiation, and the martingale difference property of {εt },
P
it follows that ∂cε ()/∂θ → −J , where J = E(εt− ⊗ ∂gt−1 /∂θ ) ( = 1, . . . , M )
is an m2 × K matrix.
Now, consider the case that the parameters of the model are estimated by
CLS. Let g t−1 ≡ g(F t−1 ; θT ). Denote the m × 1 vector of estimated residuals
by εt = Yt − g t−1 . Then, replacing εt by εt in (11.79), the m2 × 1 vector of residual
autocovariances cε
is defined naturally. By expanding cε
in a Taylor series expan-
sion, it is easy to see that cε
= cε − J(θT − θ0 ) + op (T −1/2 ), where J = (J1 , . . . , JM )
is an M m2 × K matrix. Furthermore, it can be shown (Tjøstheim, 1986b, Thm.
2.2) that the asymptotic distribution of T −1/2 ∂LT (θ0 )/∂θ is normal. Also, using the
martingale difference property of {ε
t } it can be proved (Chabot–Hallé and Duch-
esne, 2008) that T 1/2
(θT −θ0 ) , cε converges in distribution to a Gaussian random
vector. Combining these results, it follows that, as T → ∞,
√ D
T cε
−→ NM m2 0, Ω , (11.81)
where
and
J∗ = J∗ ∗ ∗ −1
1 , . . . , JM ) , with J = E εt− ⊗ εt εt Σε ∂gt−1 /∂θ , ( = 1, . . . , M ).
If {εt }√is a strict WN process, it follows that J∗ = J and H(θ 0 ) = I(θ 0 ). This implies
that T cε
converges to a Gaussian random vector with mean 0 and covariance
matrix IM ⊗ Σε ⊗ Σε − JH−1 (θ 0 )J ; see Hosking (1980).
474 11 VECTOR PARAMETRIC MODELS AND METHODS
Q(M ) = T c∗ −1 ∗ (M ∈ Z+ ),
ε
Ω cε
, (11.83)
T2 −1 cε
(),
Q() = c ()Ω ( = 1, . . . , M ), (11.84)
T − ε
where
− J
= Δ −1 −1 ∗ −1 −1
∗ H
Ω T J − J HT J + J HT I T HT J
m
ft−1 (Yt ; θ) = fij ,j−1,t−1 (Yij ,t ; θ|Aj−1 ), (11.85)
j=1
11.6 DIAGNOSTIC CHECKING 475
Table 11.3: Three diagnostic test statistics based on multivariate quantile residuals.
Null hypothesis H0 Transformation function g Test statistic
2
E Rt,θ0 Rt− , θ0 = 0m×m , ∀t, g: Rm(K1 +1) → Rm K1 T,K = S
A T,d with
1
( = 1, . . . , K1 ; K1 T ) g(ut,θ ) = d = K1 + 1
(Autocorrelation) vec(rt,θ rt+1,θ , . . . , rt,θ rt+K1 ,θ )
2
E(Ri,t,θ
2 2
, Rj,t−,θ ) = 0, ∀t, g : Rm(K2 +1) → Rm K2 T,K = S
H T,d with
0 2
and ∀i, j ∈ {1, . . . , m} g(ut,θ ) = d = K2 + 1
( = 1, . . . , K2 ; K2 T ) vec(vt,θ vt+1,θ , . . . , vt,θ vt+K 2 ,θ
(Heteroskedasticity) with vt−,θ = (r1,t,θ
2 − 1, . . . , rm,t+K
2
2 ,θ
− 1)
2 3 4
E(Rj,t,θ0 − 1, Rj,t,θ0 , Rj,t,θ0 − 3) = 0, g: R → R
m 3m NT = ST,d with
∀t, and ∀j ∈ {1, . . . , m} (Normality) 2
g(rj,t,θ ) = (rj,t,θ − 1, rt,θ3 , r 4 − 3) d=1
t,θ
where Aj−1 = σ(Yi1 ,t , . . . , Yij−1 ,t ) is the σ-algebra generated by the jth compon-
ent variable. Interpret fi1 ,0,t−1 (Yi1 ,t ; θ) = fi1 ,t−1 (Yi1 ,t ; θ), and Fi1 ,j−1,t−1 (Yi1 ,t ; θ) =
+ Yij ,t
−∞ fi1 ,j−1,t−1 (u; θ)du. Thus, generalizing (6.85), the m × 1 vector of theoretical
quantile residuals at time point t is defined by
⎛ ⎞ ⎛ −1
⎞
R1,t,θ Φ Fi1 ,t−1 (Yi1 ,t ; θ)
t,θ = ⎜ . ⎟ ⎜ .. ⎟
R ⎝ .. ⎠ = ⎝ . ⎠, (11.86)
−1
,...,R
t−d+1,θ0 ) ∈ R
(R dm and with d given in Table 11.3. Conditional on a vec-
t,θ0
tor with initial values, and assuming the conditional T density function
T ft−1 (Yt ; θ)
exists, the log-likelihood function T (y, θ) = t=1 t (Yt , θ) = t=1 log ft−1 (Yt ; θ)
of the set of observations {Yt }Tt=1 follows directly. Then, under some mild condi-
tions, Kalliovirta and Saikkonen (2010) prove a CLT for transformed vector quantile
residuals. Next, they define the general test statistic
−d+1
T −d+1
T
1 −1
ST,d = g(ut,θ
T ) ΩT g(ut,θ
T ), (11.87)
T −d+1
t=1 t=1
r
, . . . ,
where ut,θ
T = ( r T is a consistent estimator of the asymp-
) , and Ω
t,θT t−d+1,θ
T
totic covariance matrix Ω. Specifically,
Ω −1 G
TI
T = G +Ψ −1 G
TI −1 Ψ
TI
+G T,
+ H (11.88)
T T T T T T
476 11 VECTOR PARAMETRIC MODELS AND METHODS
where G T = T −1 T ∂g(u
)/∂θ , Ψ T = T −1 T g(u
)∂t (Yt , θT )/∂θ , H
T =
t=1 t,θT t=1 t,θT
T is a consistent estimator of I(θ 0 ), the expected
T −1 Tt=1 g(ut,θ
T )g(ut,θ
T ) , and I
information matrix evaluated at θ0 . In practice, one can compute these matrices
by simulation.
Moreover,
given the above null hypotheses, explicit expressions for
H = E g(Ut,θ )g(Ut,θ ) follow in a straightforward way.
Assume that the vector nonlinear model under study is correctly specified. Then
(11.87) has an asymptotic χ2n distribution; Kalliovirta and Saikkonen (2010). This
result does not depend on the chosen order of conditioning of R t,θ . Table 11.3
shows three diagnostic test statistics, as special cases of (11.87). Under H0 , these
test statistics are asymptotically distributed as respectively χ2m2 K1 , χ2m2 K2 , and χ23m .
11.7 Forecasting
where ηt and g ∗ (·) are defined in a similar way as εt and g(·) respectively, and F (·)
is the joint distribution function of the dependent processes {ηt } and {εt }. Thus,
just as in the univariate case, one can only obtain forecasts by numerical methods.
Two common approaches to computing multi-step ahead forecasts is to use MC
simulation and BS. Often, however, a BS procedure is preferred in practice since
no assumptions need to be made about the distribution of {εt }. One option is to
use some form of block bootstrapping by resampling from non-overlapping blocks
of consecutive centered residuals, say {εt }. Another option is to use a model-based
bootstrap. By this it is meant that a finite-order VAR model is first fitted to { εt },
assuming that the vector error process is i.i.d. and its components are mutually
uncorrelated. Then, assuming that the VAR residuals are i.i.d., and using the re-
cursive structure of the VAR model, it is straightforward to obtain the H-step ahead
forecast E(Yt+H |F t ) via block bootstrapping.
11.7 FORECASTING 477
where α⊥ φ∗0 = Φ 0 , α⊥ β = Φ
1 with φ∗ a scalar parameter, β is a 2 × 1
0
parameter vector, α α⊥ = 0, G(·) is a logistic transition function given by
(11.37), and {εt } ∼ (0, Σε ) independent of Yt , and Xt ≡ Yt−d (d > 0).
i.i.d.
For further evaluation of (11.93) we need to distinguish between the two cases
d < 2 and d ≥ 2. When d = 1, explicit expressions for E[G(Yt+2−d ; γ, c)]
and E[εt+1 G(Yt+2−d ; γ, c)] are not directly available; then we need to replace
them by estimates obtained via MC simulation or BS. However, when d ≥ 2,
we see that G(Yt+2−d ; γ, c) is available at time t. In this case (11.93) reduces
to
LS
Yt+2|t = E(Yt+2 |F t ) = Φ0 + Φ1 Φ0 + Φ1 α⊥ (φ∗0 + β Yt )G(Yt+1−d ; γ, c)
+ φ∗0 α⊥ + α⊥ β Φ0 + α⊥ β Φ1 Yt G(Yt+2−d ; γ, c)
+ α⊥ β α⊥ (φ∗0 + β Yt )G(Xt+1 ; γ, c)[G(Yt+2−d ; γ, c)]. (11.94)
In general, when H ≤ d, exact analytic expressions for Yt+H|t
LS
can be ob-
tained. However, when H > d, one has to resort to MC or BS methods.
For instance, in the case of block bootstrapping with a block size of one,
E[εt+1 G(Yt+2−d ; γ, c)] can be estimated by
1
B
1 (b)
B
(b) (b) (b)
ε1,t+1 G(Yt+2−d ; γ, c), ε2,t+1 G(Yt+2−d ; γ, c) ,
B B
b=1 b=1
B
and E[G(Yt+2−d ; γ, c)] by B −1
(b)
b=1 G(Yt+2−d , γ, c) with B the number of
ε1,t+1 , ε2,t+1 ) are
(b) (b) (b)
BS replicates. The steps to obtain the 2 × 1 vector εt+1 = (
as follows.
478 11 VECTOR PARAMETRIC MODELS AND METHODS
(i) Compute the bias-corrected residuals εt = εt − εt , where εt is the sample
εt }.
mean of the “raw” residuals {
(b)
(ii) Obtain the bootstrap residuals εt as random draws with replacement
from εt , taking account of serial correlation in {εt } via the Cholesky form
(b) (b)
of the sample estimate of Σε . Next, compute εt+1 as εt+1 + εt .
Alternatively, one can use a fixed block size which depends on the forecast
horizon H, or a random block size when the errors are serial correlated.
Generalized MSFE
One problem with using (11.95) is that E(et+h et+h ) is not invariant to non-singular,
scale preserving transformations. Hence, different models may yield the most ac-
curate forecasts for different transformations. To avoid this problem, Clements
and Hendry (1993) propose the so-called generalized forecast error second moment
(GFESM). Let en+h = (en+h|n , en+h+1|n+1 , . . . , en+h+(T −H−n)|n+(T −H−n) ) be the
vector of h-step ahead forecast errors. Then the GFESM is defined as the determ-
inant of the matrix E(Eh Eh ) where Eh = ( en+1 , en+2 , . . . ,
en+h ) . An estimate of
this criterion is given by
1
GFESMR (h) = |Eh Eh |, (h = 1, . . . , H), (11.96)
hR
Forecast densities
Multivariate forecast densities can be evaluated in the same fashion as discussed
in Section 10.4.3. For instance, suppose we have a series of T − n one-step ahead
forecasts of a bivariate time series Yt = (Y1,t , Y2,t ) obtained via the rolling forecast-
ing scheme as we just described. Let ft (Y1,t , Y2,t |F t−1 ) (t = 1, . . . , T − n) denote
the joint forecast density with ft (Y1,1 , Y2,1 |F 0 ) ≡ f (y1 , y2 ). Further, suppose this
density function can be factorized into the product of the conditional (c) density
and the marginal (m) density as, e.g., ft (Y1,t , Y2,t |F t−1 ) = ft (Y1,t |Y2,t , F t−1 ) ×
(c)
ft (Y2,t |F t−1 ). We can transform each element (Y1,t , Y2,t ) by its corresponding
(m)
PIT to give
(c)
Y1|2,t+1 (m)
Y2,t+1
ft (u|Y2,t , F t−1 )du, ft (u|F t−1 )du,
(c) (c) (m) (m)
U1|2,t = U2,t =
−∞ −∞
(t = 1, . . . , T − n), (11.97)
(c) (m)
where Y1|2,t+1 and Y2,t+1 are respectively the conditional and marginal one-step
ahead forecasts. The null hypothesis of interest is that the model forecasting density
corresponds to the true conditional density. That is,
where ft (Y1,t , Y2,t |F t−1 ) is the true joint forecast density. Then the two sequences
(c) −n (m) −n
{U1|2,t }Tt=1 and {U2,t }Tt=1 will each be i.i.d. U (0, 1); Rosenblatt (1952). Moreover,
the two sequences of PITs will themselves be independent.
480 11 VECTOR PARAMETRIC MODELS AND METHODS
Figure 11.6: Time plots of flow (m3 /s) of (a) Jökulsá Eystri river and (b) Vatnsdalsá
river, Iceland, (c) precipitation (mm), and (d) temperature (◦ C). Daily data covering the
time period January 1972 – December 1974; T = 1,095.
be applied for the usual (H − 1) dependence of the forecasts. That is, divide the
forecasts into sets of independent series, taking the first, the H + 1, the 2H + 1 etc.
for set 1, and the second, the H + 2, the 2H + 2 etc. for the second set, and so on.
Thus, each of the sub-series of PITs {U1 , U1+H , U1+2H , . . .}, {U2 , U2+H , U2+2H , . . .},
and {UH , U2H , U3H , . . .} should be i.i.d. U (0, 1) under H0 .
river (Q2,t ), also located in north-west Iceland. Jökulsá Eystri is the bigger river
of the two, with a large drainage basin (1,200 km 2 ) that includes a glacier (155
km2 ); as a result, the effect of temperature goes beyond producing spring snowmelt.
Vatnsdalsá has a much smaller drainage area (450 km 2 ), and some of the flow is due
to groundwater. Full description of this streamflow system is available in Tong et
al. (1985) and the references cited there.
Figure 11.6 shows time plots of the four variables. We see sharp rises and slow
declines with a more pronounced spring peak in the Vatnsdalsá flow than in the
Jökulsá Eystri flow data due to the presence of the glacier in its drainage area.
Since the recorded values of Pt represent the accumulated rain or snow at 9 a.m.
from the time of the day before, we adjust the series Pt by a forward translation of
one day. In total there are 1,095 observations for analysis.
VTARX model
Following Tsay (1998), we use Tt as a threshold variable for both flows. Furthermore,
we focus on a two-regime model. Initially, the maximum AR-order of Qi,t (i = 1, 2)
and the maximum order of the exogenous variables Pt and Tt were set at 15 and 3,
respectively. After some fine tuning, using the multivariate F test statistic of Section
11.4 and AIC, Table 11.4 reports the final equations for the bivariate two-regime
VTARX model with AIC = 16, 981.7 and BIC = 17, 355.0.5 The corresponding
threshold parameter estimate is given by r = −0.409◦ C. The number of data points
in each regime are 479 and 601, respectively.
Some observations are in order. First, the estimate of the threshold parameter
for Tt is slightly below freezing, which effectively separates the histories of Q1,t and
Q2,t into two regimes. However, only for Tt > −0.409◦ C (regime 2) the series Q1,t
strongly depends on current and one day ago temperature. This phenomenon may be
explained by the presence of the glacier in the basin. There is no effect of temperature
on Qi,t (i = 1, 2) in the other three regimes. Second, lagged precipitation has effect
on current flow for both series. The lags and amount, however, depend on Tt with
5
The parameter estimates are not completely identical to those reported by Tsay (1998). This
may be due to small differences in computer code.
482 11 VECTOR PARAMETRIC MODELS AND METHODS
Figure 11.7: HDR’s based on 50% (grey) and 90% (blue) coverage probabilities for the
GIRF of the VTARX model for a one-unit, system-wide shock; (a) Jökulsá Eystri river, Tt ≤
−0.409◦ C, (b) Jökulsá Eystri river, Tt > −0.409◦ C, (c) Vatnsdalsá river, Tt ≤ −0.409◦ C,
and (d) Vatnsdalsá river, Tt > −0.409◦ C.
Table 11.4: CLS estimates of a bivariate VTARX model for the Iceland river flow data
set; T = 1,095. Blue-typed numbers denote significant parameter values at the 5% nominal
significance level.
Table 11.5: Icelandic river flow data set. Indicator pattern of the statistically significant
values of the residual sample cross-correlation matrices for the {Q1,t } and {Q2,t } time series.
Lag
1 2 3 4 5 6 7 8 9 10
• • • • + • + • + • • • • • • • • • + •
+ + • • • • • • • • • • • • + • • • • •
much of it taken from relatively recent reports and papers. Certainly, and despite
various advantages of vector nonlinear methods over corresponding linear methods,
we should mention that these methods are not free of caveats. For instance, if the
multivariate nonlinear DGP is a “long way” from linearity (null hypothesis) due to
outliers in the series, it is likely that asymptotic test theory will not work well. In
that case, one would expect to reject the null hypothesis emphatically – with a large
number of candidate models under the alternative hypothesis. Moreover, outliers
can have a more serious effect on multivariate nonlinear conditional mean forecasts
than on univariate forecasts due to complex interactions among simultaneously ac-
quired time series. To some extent, these and other difficulties may be overcome by
adopting the vector semi- and nonparametric methods/models discussed in Chapter
12. In any case, we have seen that vector parametric nonlinear time series analysis
can be useful in giving insight into the interdependence between many time series
met in practice. With an interplay between theory and practice, further research
will no doubt result in a “nonlinearity toolkit” for vector time series.
likelihood function. Also, the repeated residual method of Subba Rao and Gabr (1984) may
be adopted for the estimation of vector BL models. Within the frequency-domain, Subba
Rao and Wong (1999) propose an extension of the method described by Sesay and Subba
Rao (1992). Kumar (1988) investigates some moment properties of bivariate BL models.
Section 11.2.2: Nieto (2005) proposes a methodology for analyzing bivariate time series
with missing data using a VSETAR model transformed into a state space form with regime
switching. The identification and estimation of the model is based on a combination of
MCMC and Bayesian approaches.
There is a wealth of literature applying VSETARs to empirical (financial) economic data.
Three interesting publications outside the area of economics are: Bacigál (2004) (bivariate
GPS data), Chan et al. (2004) (trivariate actuarial data), and Solari and Van Gelder (2011)
(five-variate sea wave and wind data).
Section 11.2.3: Yi and Deng (1994) present sufficient conditions for geometric ergodicity
of a first-order bivariate VSETAR model with two partitions in each regime. They assume
that the structural parameters of a bivariate VSETAR model with multivariate regimes are
unknown and jointly estimated with the other parameters of the model.
Section 11.2.4: Yang et al. (2007) suggest a hybrid algorithm for the estimation of TVEC
models which combines aspects of GAs and elements of simulated annealing (SA). Simulation
results show that the algorithm does a better job than either SA or GA alone.
Hansen and Seo (2002) propose a SupLM-type test statistic for testing a linear VEC model
against a two-regime TVEC model; see the function TVECM.HStest in the R-tsDyn package.
However, this test can suffer from substantial power loss (see, e.g., Pippenger and Goering,
2000 and Seo, 2006) when the alternative hypothesis is threshold cointegration. As an
alternative, Seo (2006) adopts a SupWald-type test statistic, and derives its asymptotic
null distribution. The power of the proposed test dominates the power of conventional
cointegration tests.
Section 11.2.5: Many extensions of the VSTAR models have been proposed in the liter-
ature; see Hubrich and Teräsvirta (2013) for a survey. For instance, Dueker et al. (2011)
propose a so-called vector contemporaneous-threshold STAR model. A key characteristic
of the model is that regime weights depend on the ex-ante probabilities that latent regime-
specific variables exceed certain threshold values. Several methods are available to find
good starting-values for the estimation of VSTAR models. In an MC simulation study,
Schleer–van Gellecom (2015) compares grid search algorithms and three heuristic proced-
ures: differential evolution (DE), threshold accepting (TA), and simulated annealing (SA).
It appears that SA and DE improve LVSTAR model estimation.
Section 11.3: Harvill and Ray (1998) compare the various nonlinearity test statistics in
an MC simulation study. Their results indicate that the power of the test statistics is
affected by cross-correlation between process errors terms. In general, the multivariate
test statistics tend to perform better than their univariate counterparts when the cross-
correlation is moderate or weak. For small sample sizes, the multivariate version of the
Tukey nonadditivity-type test statistic is preferable, as the test requires fewer degrees of
freedom.
Section 11.4: Li and He (2012a) develop an F -type test statistic to examine linear versus
nonlinear cointegration in a bivariate LVSTAR model. In case the null hypothesis is rejected,
they recommend to examine the time series for CNFs using an LM-type test statistic as
11.10 ADDITIONAL BIBLIOGRAPHICAL NOTES 487
proposed by Li and He (2012b). Within this context, Li and He (2013) propose a residual-
based Wald-type test statistic for CNFs in LVSTAR models.
As noted earlier, tests for nonlinearity can be quite sensitive to extreme outliers. This is, for
instance, the case with the multivariate test statistic in Algorithm 11.4. Chan et al. (2015)
propose a new and robust VSETAR-nonlinearity test statistic, and derive its asymptotic
null distribution.
There are many ways in which an estimated nonlinear vector model can be misspecified.
Yang (2012) and Teräsvirta and Yang (2014b) consider three LM-type misspecification test
statistics for possible VSTAR model extensions: a test of no serial correlation, a test of no
additive nonlinearity, and a test for parameter constancy.
Section 11.5: Billings et al. (1989) propose a method for variable selection in general (in-
cluding exogenous variables) nonlinear models based on a truncated multivariate, discrete-
time, Volterra series representation; see also Billings (2013). The method uses a recursive
orthogonal LS algorithm which efficiently combines model identification and parameter es-
timation. It can be tied to the subset model selection method for univariate nonlinear time
series models of Rech et al. (2001); see Section 12.7 for details about the method in the mul-
tivariate case. Camacho (2004) presents a strategy for building (specification, estimation,
and evaluation) bivariate STAR models; see Yang (2012) for the multivariate case.
Section 11.6: Ling and Li (1997) and Duchesne (2004), among others, present diagnostic
test statistics for checking multivariate (G)ARCH errors.
Section 11.7: Using BS and MC simulation procedures, De Gooijer and Vidiella-i-Anguera
(2003b) explore the long-term forecast ability of two threshold vector cointegrated systems
via a rolling forecasting approach. For model comparison they apply several forecast accur-
acy measures, including forecast densities. Polanski and Stoja (2012) propose a test statistic
for evaluating multi-dimensional time-varying density forecasts.
The KS test statistic of uniformity, and related GOF tests, are sometimes referred to as
omnibus tests, i.e. they are sensitive to almost all alternatives to the null hypothesis. For
evaluating forecast densities, this property implies that when an omnibus test fails to reject
H0 , we can conclude that there is not enough evidence that the time series is not generated
from the joint forecasting density. On the other hand, a rejection would not provide any
information about the form of the density. Test statistics that can be decomposed into
interpretable components may be a solution. Such a test is Neyman’s smooth test for testing
uniformity. De Gooijer (2007) explores the properties of this test statistic in a bivariate VAR
framework. Moreover, he applies the test to multivariate forecast densities obtained from
the VSETAR model in Exercise 11.5 fitted to the S&P 500 stock index data.
Section 11.8: Teräsvirta and Yang (2014b) present another study of the Icelandic river
flow data, using a VLSTAR model with a yearly sine and cosine term as input variable.
488 11 VECTOR PARAMETRIC MODELS AND METHODS
Table 11.6: Asymptotic critical values of the LRT,p (m, r0 ) test statistic (11.65) for
various bivariate VTAR models of order p; λ = (1 − r0 )2 /r02 .
Software References
Section 11.2.2: MATHEMATICA source code for testing and estimating bivariate TAR
models can be downloaded from Tomás̆ Bacigál’s web page at https://www.math.sk/
bacigal/homepage/. The R-tsDyn-package contains various functions for bivariate TVAR
estimation, simulation and linearity testing.
Section 11.3: The test results in Table 11.1 are computed with applytot.f, a FORTRAN77
program written by Jane L. Harvill and Bonnie K. Ray, and available at the website of this
book.
Section 11.4: Yang (2012, Appendix) provides a collection of R functions for the specifica-
tion and evaluation of VSTAR models; see http://pure.au.dk/portal/files/45638557/
Yukai_Yang_PhD_Thesis.pdf.
Application: Several FORTRAN77 programs for threshold estimation and parameter es-
timation of VTARX models (three regimes at most), created by Ruey S. Tsay, are available
at the website of this book.
Appendix
(iii) Draw randomly (with replacement) a sequence of vector residuals from this
set, i.e. {e∗t , . . . , e∗t+H }, where H (H ≥ 1) is the forecast horizon.
(iv) Suppose that the effect of a shock on the ith variable Yi,t (i = 1, . . . , m) is of
interest given the initial history ωi,t−1 of this variable. Then replace the ith
element of e∗t by a shock of size ei,t = δ drawn from a set of shocks. Alternat-
(δ)
∗
(v) Recover the “original residuals” by the transformation εt+j ∗ (j =
= Pe t+j
(δ)
1, . . . , H) and ε = Pe
(δ)
(i = 1, . . . , m).
i,t i,t
(vi) For each j = 1, . . . , H, and a history ωi,t−1 , generate two values of Yi,t+j ,
∗ (δ)
one using εt+j and one using εi,t (i = 1, . . . , m). Compute the differences,
(b) (δ)
say GIRFY (H, εi,t , ωi,t−1 ), between both values.
(b) (δ)
(vii) Repeat steps (iii) – (vi) B times, to obtain {GIRFY (H, εi,t , ωi,t−1 )}B
b=1 (i =
1, . . . , m). Finally compute, as an estimate of the GIRF (11.98), the sample
B
average GIRFY,i,H = B −1 b=1 GIRFY (H, εi,t , ωi,t−1 ) for each variable i
(b) (δ)
Repeating steps (i) – (vi) a sufficiently large number of times (say R), an estimate of the
unconditional pdf of the random GIRF, given ωi,t−1 follows directly. So, each time a new
(δ)
set of histories is drawn from a given initial set of histories. If the size of the shock ei,t
and/or subset of histories is restricted, a conditional estimate of the pdf can be obtained. In
the application of Section 11.8, we set H = 10, B = 1,000 and R = 1,000. Finally, it is good
to mention that for unrestricted linear VAR and cointegrated VAR models the computation
of GIRFs do not require orthogonalization of shocks, as in step (ii) above, and they are
invariant to the ordering of the variables in the VAR; see Pesaran and Shin (1998).
Exercises
Theory Questions
11.1 Given the BL model (11.14), verify condition (11.16).
# p $
[Hint: First show that exp E log
Θv + u=1 Ψuv [Yt−u ⊗ Im ]
< 1 (v ∈ {1, . . . , q};
q ≤ p). Next, prove (11.16), using Jensen’s inequality, the Cauchy–Schwarz inequality,
the strict stationarity of the process (ergodic theorem), and using the properties of
vectors and matrices given in Appendix 7.A.]
EXERCISES 491
Here, 1 = (1, . . . , 1) , and R(i) denotes the support of the associated density function,
assuming it exists. Then a multivariate analogue of the univariate asMA(1) model is
defined as
2
Yt = Xt + Bi I Xt−1 ∈ Mi Xt−1
i=1
(c) Using the results in part (b), and assuming the process {Yt , t ∈ Z} is weakly
stationary, show that
11.3 Consider the LM-type test statistic for testing linearity versus the LVSTAR model in
(11.69).
11.4 Let U1 and U2 be two independent random variables each U (0, 1) distributed.
(a) Show that the random variable U (p) = U1 × U2 has a distribution function given
by FU (p) (x) = x − x log(x) if 0 < x < 1.
(b) Show that the distribution function of U (r) = U1 /U2 is given by FU (r) (x) = x/2
if 0 < x < 1, and FU (r) (x) = 1 − (1/2x) if 1 < x < ∞.
(c) Show that the distribution function of U (p) = (U1 − 1 )(U2 − 1 ) is given by
2 2
−2x log 2 + 2x − 2x log(2x) + 12 , x > 0,
FU (p) (x) =
−2x log 2 + 2x − 2x log(−2x) + 12 , x < 0.
Empirical Questions
11.5 The V(SE)TAR model is a useful tool to study index futures arbitrage in finance. Tsay
(1998) studies the intraday (1–minute) transactions for the S&P 500 stock index in
May 1993 and its June futures contract traded at the Chicago Mercantile Exchange.
Specifically, let {Yt = (Y1,t , Y2,t , Xt ) }7,060
t=1 denote the data set under study (file:
intraday.dat) with Y1,t = ft, − ft−1, and Y2,t = st − st−1 , where ft, is the log price
of the index futures at maturity , and st is the log of the security index cash prices.
(a) Check the threshold nonlinearity of the series {Yt } using the test statistics
(T) (O)
FT,p (m) (Algorithm 11.2), FT,p (m) (Algorithm 11.3), and CT,p (d, m) (Al-
gorithm 11.4). In all cases, assume that a VAR(8) model best describes the
interdependencies between the two series.
(b) Using LS, estimate the parameters of the following bivariate VSTARX(2; 8, 8)
model
8
(1) (1) (1)
φ0 + u=1 Φu Yt−u + β1 Xt−1 + εt if Xt−1 ≤ r,
Yt = (2) 8 (2) (2)
φ0 + u=1 Φu Yt−u + β2 Xt−1 + εt if Xt−1 > r,
where Xt is an exogenous variable (column three of the available data set) con-
(i)
trolling the switching dynamics, r is a real number, Φu (i = 1, 2; u = 1, . . . , p)
(i)
are 2 × 2 matrices of coefficients, φ0 and βi are 2 × 1 vectors of unknown para-
(i) (i) (i) (i)
meters. The error process {εt } satisfies εt = (Σε )1/2 εt , where Σε (i = 1, 2)
i.i.d.
are 2 × 2 symmetric positive definite matrices, and {εt } ∼ N (0, I2 ). Provide
an (economic) interpretation for the estimation results.
(1)
(c) Apply the LVSTAR nonlinearity test statistic LM T,p (m) (Algorithm 11.5), and
(1)
the rescaled FT,p (m) test statistic (Expression (11.73)) to the intraday transac-
tion series, letting p = 8. Compare the test results with those of part (a).
11.6 Consider the monthly percentage growth of personal consumption expenditures, and
the percentage growth of personal disposable income in the U.S. for the time period
January 1985 – December 2011 (T = 324). Both series are measured in millions of
dollars, and months are seasonally adjusted at annual rates. Let {Yi,t } (i = 1, 2)
denote the logs of the first differences of the two series. Li and He (2013) use the first
EXERCISES 493
3
3
The last 60 observations are set aside for out-of-sample forecasting in a rolling fore-
casting framework. Thus, the first forecast origin is 264. Then h-step ahead forecasts
(h = 1, . . . , H) are obtained with maximum forecast horizon H = 1, 3, and 6. Next,
at time t = 264, the parameters of the model are re-estimated as new observations be-
come available, but the model structure remains unchanged. This process is repeated
until t extends as far as 323.
The aim of this exercise is to compare the out-of-sample forecasting performance
of (11.99) with forecasts obtained from a VAR(3) model fitted to the series {Yi,t }
(i = 1, 2).
(a) The file con inc.dat contains the original, untransformed data. Obtain H-step
forecasts (with H = 1, 3 and 6) from a VAR(3) model in a similar manner to
the rolling forecast experiment described above.
Collect the corresponding three series of forecast errors in appropriately named data
files. The data files eNL1.dat (T = 60), eNL3.dat (T = 176), and eNL6.dat (T = 335)
contain the H-step ahead forecast errors (H = 1, 3, and 6) from the LVSTAR(3)–CNF
model.
Quite often it is not possible to postulate an appropriate parametric form for the
DGP under study. In such cases, semi- and nonparametric methods are called for.
Certain of these methods introduced in Chapter 9 can be easily extended to the
multivariate (vector) framework. Specifically, let Yt = (Y1,t , . . . , Ym,t ) denote an
m-dimensional process. We consider again the general nonlinear VAR(p) model
on this test statistic and hence it can serve as an initial way to infer causal rela-
tionships. In Section 12.5, we then introduce three formal nonlinear causality test
statistics. These tests are closely related to test statistics for high-dimensional serial
independence, which we discussed earlier in Chapter 7.
Two appendices are added to the chapter. Appendix 12.A provides information
about the numerical computation of multivariate conditional quantiles. Appendix
12.B discusses how to compute percentiles of the vector based analogue of the uni-
Y (·) introduced in Section 1.3.3.
variate test statistic R
ϕ(θ, x) = E(
Y − θ
q −
Y
q |X = x)
= (
y − θ
q −
y
q )Q(dy|x), (12.2)
Rm
ϕT (θ, x) =
y − θ
q −
y
q QT (dy|x)
Rm
T
Kh (x − Xt )
=
Yt − θ
q −
Yt
q T .
t=1 t=1 Kh (x − Xt )
and the estimator is consistent (De Gooijer et al., 2006). In Appendix 12.A, we
discuss the computation of (12.4).
1/2 1/2
The innovations satisfy εt = Σε ηt where Σε = diag(0.2, 0.2) is a symmet-
ric positive definite matrix, {ηt } is a sequence of serially uncorrelated bivariate
498 12 VECTOR SEMI- AND NONPARAMETRIC METHODS
Figure 12.1: True and estimated bivariate conditional quantile functions at q = 0.5 for a
typical MC simulation of the NLAR(1) process (12.5).
Xt = (Wt , . . . , Wt+p−1
) and Yt = Wt+H+p−1
(t = 1, . . . , n; n = T − H − p + 1),
n
θq,T = arg min ρq (Yt − θ)Kh (XT −p+1 − Xt ), (12.6)
θ∈R
t=1
where
and ρq (u) = 0.5(|u| + (2q − 1)u), i.e. the check function. Note that in the
univariate case, the series {W1,t } and {W2,t } are considered separately.
We need some measure to evaluate how well the quantile forecasts from the
two methods are doing. To this end, we use a rolling forecast framework of 800
observations which gives a total of 498 conditional quantiles for each forecast
step. Then, for each q, we calculate the following accuracy measure
1
Härdle et al. (1998) discuss an LL kernel-based method for the estimation of (12.1) in the
multivariate case, allowing for conditional heteroskedasticity of the error process. They use a
longer version of the above bivariate data set.
500 12 VECTOR SEMI- AND NONPARAMETRIC METHODS
Figure 12.2: Daily returns (rescaled) of the exchange rates data set; (a) {W1,t =
DEM/USD} and (b) {W2,t = DEM/GBP} for the time period January 3, 1990 – December
28, 1994; T = 1,300.
Table 12.1: Exchange rates data set. Values of the accuracy measure qi,H based on 498
out-of-sample forecasts; θq,T is the multivariate conditional quantile estimator, and θq,T is
the univariate conditional quantile estimator. Blue-typed numbers indicate values which are
statistically significantly different from q. From De Gooijer et al. (2006).
1
498
I(Wi,T +H+j−1 θq,T ), (i = 1, 2; H = 1, 2, 3; T = 800),
(H)
qi,H =
498
j=1
where θq,T is either θq,T (multivariate) or θq,T (univariate) with the super-
(H) (H) (H)
script (H) denoting the H-step ahead prediction. If the conditional quantiles
are accurate, we expect the value of qi,H to closely approximate q. Table 12.1
shows the results for qi,H . The results of the significance test are obtained us-
ing the Gaussian assumption and using the well-known fact that the standard
deviation for a set of n = 498 proportions equals (q(1 − q)/n)1/2 .
Given their role in Value at Risk calculations, a type of risk in a financial
market (see, e.g., Tsay, 2010), we only discuss the conditional quantile results
for the lower tail quantile levels, q = 0.01, 0.025, and 0.05. The qi,H values
from the calibration of the conditional quantiles of the {W2,t = DEM/GBP}
series shows that θq,T consistently underpredicts tail quantile values, with lar-
ger biases at q = 0.025 and q = 0.05. In contrast, for the {W1,t = DEM/USD}
series, θq,T performs as well as θq,T , in terms of its empirical q or qi,H . The
distribution of the DEM/GBP returns has a rather heavy tail with a standard-
ized kurtosis of 18.2. Thus, it may not be a surprise when θq,T underpredicts
the tails. However, when the returns are jointly considered in a multivari-
ate fashion, the tails of the DEM/GBP distribution are accurately tracked by
θq,T with no statistically significant bias. 2 In all cases, the bandwidths hi,T
(i = 1, 2) are chosen according to the rule-of-thumb (9.22).
2
In finance, it is common to assume normality of returns although it is well known that one of
the stylized facts of many financial time series is their being heavy tailed and most often asymmetric.
The most usual way of estimating quantile predictions is by first computing conditional variance
(volatility) predictions and then make a normality assumption. Obviously, this parametric approach
leads to a sizeable underprediction of tail events because in practice returns are not normally
distributed. In contrast, the multivariate conditional quantile approach can be computed directly
and no distributional assumptions about the process under study are needed.
502 12 VECTOR SEMI- AND NONPARAMETRIC METHODS
12.2.1 PolyMARS
PolyMARS, or for short PMARS, is an extension of the MARS procedure (see Sec-
tion 9.2.3) that allows for multiple polychotomous regression; Kooperberg et al.
(1997). The method was introduced primarily to extend the advantages of the
(TS)MARS algorithm over simple recursive partitioning to the multiple classific-
ation problem, in which multinomial response data is considered as a set of 0 –
1 multiple responses. With PMARS, by letting the predictor variables be lagged
values of multivariate time series, one obtains a new method for modeling vec-
tor threshold nonlinear time series with or without additional (lagged) exogenous
predictors. The resulting specification, called vector adaptive spline threshold AR
(eXogenous) (VASTAR(X)) model can be considered as a type of generalized VTAR
model.
Description of PMARS
Let Yt = (Y1,t , . . . , Ym,t ) ∈ Rm be an m-dimensional time series which depends
on q pj -dimensional vectors of time series variables Xj,t = (Xj,t−1 , . . . , Xj,t−pj )
(pj ≥ 0; j = 1, . . . , q). Assume that there are T observations on {Yt } and {Xj,t }
and that the data are presumed to be described by the time series regression model
random variables which are correlated with those from the other regressions, as
specified in (12.10) below.
The goal of semiparametric multivariate regression modeling is to construct a
data-driven procedure for simultaneous estimation of the unknown functions μ() (Xt )
where Xt = (X1,t , . . . , Xq,t ) . Specifically, each regression function is modeled as a
linear combination of S > 0 basis functions Bs (Xt ), so that for a function μ() (·),
S
(Xt ) =
μ ()
βs() Bs (Xt ), ( = 1, . . . , m). (12.9)
s=1
• xi ;
• (xi − τis )+ (xj − τjs )+ if xi (xj − τjs )+ and xj (xi − τis )+ are in the model.
This procedure is a little different from that of (TS)MARS, which constrains the
set of candidate basis functions at each step in a slightly different way. PMARS
thus creates a preference for linear models over nonlinear ones, while interactions
are only considered if they are between predictors that are already in the model.
Further note that PMARS, in contrast to (TS)MARS, does not allow basis functions
of the form (τs − x)+ .
Let X,t = (b1 (Xt ), . . . , bS (Xt )), and β = (β1 , . . . , βS ) ( = 1, . . . , m). Then,
() ()
given a choice of a particular basis for the approximation at (12.9), (12.8) can be
placed into vector notation as follows:
Yt = Xt β + εt . (12.10)
Model selection
Analogous to the univariate (TS)MARS methodology, we can use a GCV criterion
504 12 VECTOR SEMI- AND NONPARAMETRIC METHODS
for model selection. Given a maximum number M of basis functions (M ≥ S), the
criterion is given by
m T
T −1
()
=1 t=1 {Y,t
−μM (Xt )}2
GCV(M ) = , (12.11)
{1 − (d × M )/T }2
where d is a user-specified constant that penalizes for larger models. A value of d
such that 2 ≤ d ≤ 5 is recommended in practice. The value of M is commonly set
equal to min([6T 1/3 ], [T /4], 100).
Alternatively, a test data set can be used for model selection by specifying the
test response data, and the test predictor values. Then compute for each fitted
model the residual sum of squared errors (RSS) of the test set. Next, select at
each stage the VASTAR model with the smallest RSS. Fitting a VASTAR model
to all data except a test set of length h and evaluating the model over the test set
corresponds to a leave-out h CV method evaluated only for a single block of series.
Forecasting
Multi-step ahead forecasts for PMARS models can be made using a naive, or plug-in,
iterative approach as a simple extension of (9.54). Specifically, correlations between
the forecast errors of the component variables should be considered in a vector
framework. In that case, the method of model-based block bootstrapping may be
used as an alternative to the “plug-in” method.
where each αi is a p-dimensional vector and α and β,i are chosen using an LS
criterion. Each φi (·) is a univariate function of the projection α Xt estimated
non-
parametrically
using a kernel-based smoothing method such that E φ i (·) = 0 and
Var φi (·) = 1. PPR thus searches for low-dimensional linear projections of a high-
dimensional data cloud that can be transformed using nonlinear functions and added
together to approximate the structure of {Yt , t ∈ Z}.
Example 12.3: Sea Surface Temperatures (Cont’d)
Recall, in Example 9.7 we showed a TSMARS model fitted to a subset of the
transformed daily SSTs at Granite Canyon, i.e. the series {Yt }1,825
t=1 with lagged
values of Yt , lagged values of wind speed data {WSt }, and lagged values of wind
12.2 SEMIPARAMETRIC METHODS 505
Table 12.2: Estimated β and αi values for the PPR model fitted to the SST and wind
speed (WS) data set.
i )
Predictor weights (α Coefficients
i Yt−1 WSt−1 WDt−1 WSt−4 WSt−9 β1,i β2,i
The fitted PMARS model suggests that lagged values of WS t have only a min-
imal effect on transformed SSTs. There is indication that winds blowing from
the North (coded as 2) act to lower SSTs on the following day. Transformed
wind speeds are modeled primarily as a function of lagged transformed wind
speeds. Wind speeds greater than 2.445 act to increase the wind speed on the
following day, as do winds blowing from the North. Taking the inverse trans-
form, the threshold value translates into 10.53 knots, or about 12 mph (5.5
m/sec). The PMARS model explains about 80.5% of the observed variation
in SSTs, while explaining only 11.4% of observed variability in wind speeds.
Based on the PMARS model fitting results, we apply PPR with M = 3 using
Yt−1 , WSt−1 , WDt−1 , WSt−1 , and WSt−9 as predictor variables, giving p = 5.
Figure 12.3 shows φi (·) as a function of α
i Xt . Table 12.2 gives the estimated
values of αi and β,i .
The α 1 vector suggests that a combination of lagged wind speeds and lagged
wind directions affect the responses. The α i (i = 2, 3) vectors have most
weight given to Yt−1 . The φ1 (·) function is fairly linear, with a slope near
1. The coefficient of φ2 (·) is 0.077 for the SST response, thus this term cor-
responds roughly to the term 0.8971Yt−1 in (12.13). The nonlinear nature of
φ3 (·) suggests a nonlinear relation between SSTs and wind speeds and the
SST of the previous day. The fitted PPR model explains about 75.2% of the
506 12 VECTOR SEMI- AND NONPARAMETRIC METHODS
φ1 (α
1 Xt )
1 Xt
α
φ2 (α
2 Xt )
2 Xt
α
φ3 (α
3 Xt )
3 Xt
α
Figure 12.3: Estimated functional relationships φi (·) (i = 1, 2, 3) for the PPR model fitted
to the SSTand wind speed (WS) time series.
variance in the SST series, while only 12.5% of the variability in wind speeds
is explained, comparable to the PMARS model results. The wind direction
predictor variable does not play a significant role in the fitted PPR model.
Estimation
The elements of the matrices Φj (·) can be estimated from the observations
{(Xt , Yt )}Tt=1 using local constant or LL multivariate regression in a neighborhood
of Xt with a specified kernel and bandwidth matrix. At time t, denote the AR fit
order by p∗ , and the mp∗ -dimensional vector of predictors by Zt ; that is, let
Zt = (1 , Yt−1
, . . . , Yt−p
∗) ,
Yt = Φ(Xt )Zt + εt , (t = p∗ + 1, . . . , T ).
For the sake of discussion, we temporarily restrict the dimension of the functional
variable Xt to q = 1. Since all elements of Φ(·) have continuous second-order
(j)
derivatives, we may approximate each φ,k (·) locally at a point x0 ∈ R by a linear
(j) (j) (j)
function φ,k (x) = a,k + b,k (x − x0 ). Partitioning the coefficient matrices in the
form (a | b), the LL kernel-based estimator of Φ(·) is defined as Φ(x 0) = a , where
a | b) is the solution to (a | b) that minimizes the weighted sum of squares
(
Zt
T
Zt
Yt − (a | b) Yt − (a | b) Kh (x0 − Xt ). (12.17)
Ut Ut
t=p∗ +1
U WU is non-singular, and
Forecasting
Forecasting with VFCAR models can be based on, for instance, the naive, or plug-in,
method, on MC simulation, and BS. For ease of discussion, consider the univariate
FCAR model (9.61) with Yt = (Yt−1 , . . . , Yt−p+1 ) . The goal is to find the H-step
ahead (H ≥ 1) MMSE forecast of Yt+H , i.e.
p
E(Yt+H |Yt ) = φi (Yt+H−d )E(Yt+H−d |Yt ), (12.18)
i=1
assuming φi (·) is known. The BS forecast method is, by far, most commonly used
for this purpose. That is, the H-step ahead (H ≥ 2) forecast is given by
1 (b)
B
Yt+H|t
BS
= Yt+H|t , (12.19)
B
b=1
where
∗
p
Yt+H|t φi (Yt+H−d|t )Yt+H−i|t + e(b) ,
(b)
= (12.20)
i=1
Model assessment
Specific choices for the elements of the matrices Φj (·) in (12.16) can result in para-
metric vector time series models. This feature is particularly useful, and can be
assessed by testing the null hypothesis
with RSSi (i = 0, 1) the matrix residual sum of squares obtained under Hi , given
an estimator θ of θ in the specified parametric model Gj (·; θ). Large values of LRT
indicate that H0 should be rejected.
Finding the distribution of the test statistic (12.21) in finite samples is a difficult
problem. However, along the same lines as Algorithm 9.6, Harvill and Ray (2006)
propose the following bootstrap procedure.
∗,(b) B
(iii) Repeat step (ii) B times, to obtain {LRT }b=1 .
1+ I LRT ≥ LRT
p = b=1
.
1+B
Figure 12.4: Estimated AR(1) coefficients for the VFCAR model of SST and wind speed
(WS) as a function of lag one wind speeds (WSt−1 ).
where
∞
ΣY () ≡ Cov(Yt , Yt+ ) = Ψj+ Σε Ψj ,
j=−∞
with Σε = E(εt εt ), and C2,ε = vec(Σε ). Then the (m2 × 1) second-order spectral
vector, denoted by fY (ω), is related to gY (ω) by the expression
where
j,k (ωj , ωk ) = f∗ (ωj , ωk ) gY (ωj ) ⊗ gY (ωk ) ⊗ gY (−ωj − ωk ) −1 fY (ωj , ωk ), (12.31)
R Y
with fY (ωj , ωk ) the bispectral vector estimator, and L a lattice in the principal
domain D defined by (4.7). Then SY is asymptotically distributed as χ22m3 P under
the null hypothesis of Gaussianity, with P the number of R j,k ’s in D.
Under the null hypothesis of linearity, and as T → ∞, the test statistic SY
−1
(j,k)∈L Rj,k − 2m .
2
is asymptotically distributed as χ2m3 P (λ0 ) where λ0 = P 3
Table 12.3: Climate change data set. Indicator pattern of the statistically significant
values of the sample ACF, sample PACF, R(), Kendall’s τ() and Kendall’s partial τp ()
test statistics for the δ 13 C and δ 18 O time series; T = 216.
a similar vein, Harvill and Ray (2000) define the multivariate version of the mutual
information coefficient (1.20) at lag by
Simulation results indicate that the corresponding sample estimate of Ri,j (), say
i,j (), identifies appropriate lagged nonlinear bivariate MA terms. Kendall’s τ()
R
and partial τp () test statistics have some power in identifying appropriate lagged
nonlinear MA and AR terms, respectively, when the relationship between the lagged
variables is monotonic. These test statistics fail when the nonlinear dependence is
nonmonotonic, as with bivariate NLMA models.
to δ 18 O at lags four and five involving no feedback. For Kendall’s τ() test
statistic, we see a significant bi-directional relationships between δ 13 C and
δ 18 O up to and including lag three. Additionally, values of Kendall’s partial
τp () test statistic are nonsignificant after lag one. In summary, these last
three statistics suggest that a first-order NLAR model might be appropriate
to model the interdependence between the two climate variables.
We have seen that the nonparametric test statistic R() can serve as an initial
way to infer causal nonlinear relationships. Some subjective interpretation problems,
however, exist with this approach. We therefore need some more formal method to
investigate causality, and we shall see in the next section how to achieve this.
12.5.1 Preamble
Identifying causal relationships among a set of multivariate time series is import-
ant in fields ranging from physics to biology to economics. Indeed, using Granger’s
(1969) parametric causality test statistic there exists a large body of literature ex-
amining the presence of causal linear linkages between bivariate time series. On
the other hand, there is substantially less literature on uncovering nonlinear causal
relationships among strictly stationary multivariate time series variables. In this
section, we discuss the concept of Granger causality in a more flexible nonparamet-
ric setting for both bivariate and multivariate time series processes. However, before
doing so, we first introduce the general setting for testing causality.
Assume {(Xt , Yt ); t ∈ Z} is a strictly stationary bivariate time series process. We
say that {Xt , t ∈ Z} is a strictly Granger cause of {Yt , t ∈ Z} if past and current
values of Xt contain additional information on future values of {Yt } that is not
contained in the past and current Yt -values alone. More formally, let FX,t and F Y,t
denote the information sets consisting of past observations of Xt and Yt up to and
including time t. Then the process {Xt , t ∈ Z} is a Granger cause of {Yt , t ∈ Z} if,
for some H ≥ 1,
D
(Yt+1 , . . . , Yt+H ) |(FX,t , F Y,t ) ∼ (Yt+1 , . . . , Yt+H ) |F Y,t . (12.33)
This definition is general and does not involve model assumptions. In practice one
often assumes H = 1, i.e. testing for Granger non-causality (bivariate) comes down
to comparing the one-step ahead conditional distribution of {Yt , t ∈ Z}, with and
12.5 NONPARAMETRIC CAUSALITY TESTING 515
without past and current observed values of {X t, t ∈ Z}. Note, the testing framework
introduced above concerns conditional distributions given an infinite number of past
observations. In practice, however, tests are usually confined to finite orders in
{Xt , t ∈ Z} and {Yt , t ∈ Z}. To this end, we define the delay vectors
Xt = (Xt , . . . , Xt−X +1 ) and Yt = (Yt , . . . , Yt−Y +1 ) , (X , Y ≥ 1).
If past observations of {Xt , t ∈ Z} contain no information about future values, it
follows from (12.33) that the null hypothesis of interest is given by
H0 : Yt+1 |(Xt , Yt ) ∼ Yt+1 |Yt . (12.34)
For a strictly stationary bivariate time series, (12.34) comes down to a statement
about the invariant distribution of the dW = (X + Y + 1)-dimensional vector
Wt = Xt , Yt , Zt ) where Zt = Yt+1 . To simplify notation, we drop the time index
t, and just write W = (X , Y , Z) .
Under H0 , the conditional distribution of Z given (X , Y ) = (x , y ) is the same
as that of Z given Y = y. Then (12.34) can be restated in terms of ratios of joint
distributions. Specifically, the joint pdf fX,Y,Z (x, y, z) and its marginals must satisfy
the relationship
fX,Y,Z (x, y, z) fY,Z (y, z)
= ,
fX,Y (x, y) fY (y)
or equivalently
fX,Y,Z (x, y, z) fX,Y (x, y) fY,Z (y, z)
= , (12.35)
fY (y) fY (y) fY (y)
for each vector (x , y , z) in the support of W.
where
−1
W (h) = T
C I(
Wi − Wj
< h).
2
1≤i≤j≤T
Since the correlation integral is a U-statistic (Appendix 7.C), it can be shown (Hiem-
stra and Jones, 1994, Appendix) that, under H0 ,
√ D
T QT,W (h) −→ N 0, σW 2
(h) , as T → ∞, (12.38)
where σW2 (h) is a lengthy expression, not given here. An autocorrelation consistent
estimator of σW 2 (h) follows from using the theory of Newey and West (1987). In
practice, it is recommend to use one-sided critical values of QT,W (h). Bai et al.
(2010) extend the HJ test statistic to the multivariate case.
For an appropriate sequence of bandwidths, the estimator fW (·) of the pdf fW (·) is
consistent. So, Q∗T,W (h) consists of a weighted average of local contributions given
by the expression in curly brackets, which tends to zero in probability under H0 .
The test statistic (12.41) can be rearranged in terms of a U-statistic as follows,
1
Q∗T,W (h) = K(Wi , Wj , Wk ), (12.42)
T (T − 1)(T − 2)
i=j=k=i
where
(2h)−dX −2dY −dZ
K(Wj , Wj , Wk ) =
3!
(XY Z) (Y ) (XY ) (Y Z) (XY Z) (Y ) (XY ) (Y Z)
(Iik Iij −Iik Iij ) + (Iij Iik −Iij Iik )
(XY Z) (Y ) (XY ) (Y Z) (XY Z) (Y ) (XY ) (Y Z)
+ (Ijk Iji −Ijk Iji ) + (Iji Ijk −Iji Ijk )
(XY Z) (Y ) (XY ) (Y Z) (XY Z) (Y ) (XY ) (Y Z)
+ (Iki Ikj −Iki Ikj ) + (Ikj Iki −Ikj Iki ) .
σW2
(h) = 9 Var r0 (Wi ) , with r0 (w) = lim E K(w1 , W2 , W3 ) ,
h→0
W () =
γ r0 (Wi ) − QT r0 (Wi+ ) − QT ,
T −
i=1
518 12 VECTOR SEMI- AND NONPARAMETRIC METHODS
rate β = 2/7. This implies a bandwidth of approximately 1.5, with C = 7 and T = 216. The bias
of the HJ test statistic QT,W (h) cannot be removed by choosing a bandwidth smaller than 1.5.
12.5 NONPARAMETRIC CAUSALITY TESTING 519
(a)
Lag 1 Lag 2 Lag 3 Lag 4 Lag 5
1 2 1 2 1 2 1 2 1 2
(b)
3 3 3
1 2 1 2 1 2 1 2 1 2
Figure 12.5: Extended Climate change data set. Nonparametric causality testing at lags
Y1 = Y2 = 1, . . . , 5; with h = 1.5; (a) QT,W (h) test statistic and (b) Q∗T,W (h) test statistic.
The single arrow symbol marks a p-value in the range 1% – 5%, and the double arrow symbol
marks a p-value smaller than 1%; T = 216.
confounding effect of other variables. One simple way to control these additional
variables is by pre-filtering the multivariate data by a parametric model (e.g. a
linear VAR model), and next performing a bivariate causality test of the residuals
pairwise. As an alternative, Diks and Wolski (2016) generalize the bivariate test
statistic Q∗T,W (h) to a multivariate setting. Following these authors, we first state a
generalization of (12.33).
Consider the strictly stationary multivariate time series process {(Xt , Yt , Qt ), t ∈
Z}, where {Xt , t ∈ Z} and {Yt , t ∈ Z} are univariate time series processes, and
{Qt , t ∈ Z} is a univariate or multivariate time series process. Then the process
{Xt , t ∈ Z} is a Granger cause of {Yt , t ∈ Z} if, for some H ≥ 1,
D
(Yt+1 , . . . , Yt+H ) |(FX,t , F Y,t , F Q,t ) ∼ (Yt+1 , . . . , Yt+H ) |F Y,t F Q,t , (12.45)
where FX,t , F Y,t , and F Q,t are the corresponding information sets. Note, the as-
sumption that both {Xt , t ∈ Z} and {Yt , t ∈ Z} are scalar-valued time series pro-
cesses makes it possible to determine whether the causal relationship between these
two processes is direct or mediated by other variables.
Now, consider the same setup as in Section 12.5.1 with the delay vectors Xt , Yt ,
and Qt = (Qt , . . . , Qt−Q +1 ) . So, the multivariate analogue of the null hypothesis
(12.34) is given by
H0 : Yt+1 |(Xt , Yt , Qt ) ∼ Yt+1 |(Yt , Qt ). (12.46)
520 12 VECTOR SEMI- AND NONPARAMETRIC METHODS
For simplicity, assume that the embedding dimensions are all equal to unity, i.e. X =
Y = Q = 1. Thus, the dimensionality of the vector Wt = (Xt , Yt , Qt , Zt ) , where
Zt = Yt+1 , is a number dW ≥ 4. In this case, and following the same reasoning as
in Section 12.5.3, the asymptotic normality condition becomes 1/(2ν) < β < 1/dW .
So, for a standard second-order kernel (ν = 2) and dW ≥ 4, there is no feasible
β-region which would endow the test statistic Q∗T,W (h) with asymptotic normality.
The associated problem is the well-known curse of dimensionality.
One solution, followed by Diks and Wolski (2016), is to improve the precision of
the density estimator by reducing the kernel estimator bias using data-sharpening
(Hall and Minotte, 2002) as a bias reduction method. The sharpened (s) form of
the plug-in density estimator is given by
h−dW Wi − ψp (Wj )
fW
s
(Wi ) = K ,
T −1 h
j,j=i
where ψp (·) is a so-called sharpening function, with p the order of bias reduction.
On replacing the data by their sharpened form in the definition of the kernel density
estimator fW (·) one obtains an estimator of fW (·) of which the bias equals O(h4 ),
with p ≡ dW = 4, rather than O(h2 ) (Hall and Minotte, 2002) for fW (·). In this
case the sharpening function is of the form
μ2 (K) f (W )
ψ4 (W ) = I + h2 ,
2 f(W )
+
where I denotes the identity function, μ2 (K) = R u2 K(u)du, and f is the estimator
of the gradient of f . In practice, the NW kernel estimator may be used as an
approximation for the ratio f (W )/f(W ). Clearly, the lower order of the bias makes
it possible
to find a range of feasible β-values again, in this case β ∈ 1/(2p), 1/dw ) =
(1/8, 1/4 .
The sharpened form of the test statistic is given by
T − 1 s
Under certain mixing conditions Diks and Wolski (2016, Appendix B) show that, as
T → ∞,
√ QsT,W (h) − Δ D 1 1
T −→ N (0, 1), iff <β< , (12.48)
ST 2p dW
√ s
(i) For a given sample size T , select the polynomial order in the truncated Volterra represent-
ation for {Yt }Tt=1 . A larger is necessary for larger T .
(ii) Regress {Yt } on all variables (lagged values of {Yt }, any exogenous variables, and products
up to order of all lagged values and exogenous variables) and compute the value of an
appropriate model selection criterion, such as AIC or BIC.
(iii) Omit one regressor from the original model, regress the time series {Yt } on all remaining
variables in the th order Taylor series expansion and compute the value of the selection
criterion.
(iv) Repeat, omitting one regressor each time. Continue, omitting two regressors at a time, etc.
until the regression consists of only a constant term (all regressors removed, corresponding
to {Yt , t ∈ Z} being WN).
12.8 DATA AND SOFTWARE REFERENCES 523
(v) The combination of regressors resulting in the optimal model selection criterion value is
selected.
Section 12.5: By exploiting the geometry of reproducing kernel Hilbert spaces, Marinazzo
et al. (2008) develop a nonlinear Granger causality test statistic for bivariate time series.
Gao and Tian (2009) consider the construction of Granger causality graphs for multivariate
nonlinear time series. Péguin–Feissolle et al. (2013) propose two test statistics for bivariate
Granger non-causality in a stationary nonlinear model of unknown functional form. The
idea is to globally approximate the potential causal relationship between the variables by
a Taylor series expansion. A few applications of the test statistics in Section 12.5.3 have
been reported. For instance, Bekiros and Diks (2008) investigate linear and nonlinear causal
linkages among six currencies. De Gooijer and Sivarajasingham (2008) apply both paramet-
ric and nonparametric Granger causality tests to determine linkages between international
stock markets. Francis et al. (2010) use both linear and nonlinear causality tests to examine
the relationship between the returns on large and small firms.
Software References
Section 12.2.1: PolyMARS (or PMARS) is available in the R-polspline package. The R-
fRegression package has an option for computing a PMARS model as a part of the function
regFit; see also the references to software packages in Section 9.5.
Section 12.2.2: The function ppr in the R-stat package, and the function ppreg in S-Plus
both allow for PPR model fitting with multivariate responses.
Section 12.5: R codes for performing the HJ (hj.r) and the Diks–Panchenko (dp.r) nonpara-
metric test statistics are available at the website of this book. The C source code, and an
executable file, for computing both test statistics can be downloaded from http://www1.
fee.uva.nl/cendef/upload/6/hjt2.zip. Alternatively, a windows version and C
source code are available at http://research.economics.unsw.edu.au/vpanchenko/
#software. C source code for the multivariate nonlinear nonparametric Granger causality
test is available at http://qed.econ.queensu.ca/jae/datasets/diks001/.
Appendix
T
θq,T (x) = arg minm
Yt − θ
q Kh (x − Xt )
θ∈R
t=1
T
= arg minm (
Yt − θ
q )2 Gq (x, Xt , Yt ; θ, h), (A.1)
θ∈R
t=1
where
Kh (x − Xt )
Gq (x, Xt , Yt ; θ, h) = .
Yt − θ
q
(1) (r)
We now follow an iterative approach to solve (A.1). Let θq,T (x), . . . , θq,T (x) be success-
ive approximations of θq,T (x) obtained in consecutive iterations. Let 1 = (1, . . . , 1) denote
a unity row vector with dimension m. First, we define the T × m matrix Wq (·) as a direct
(or Hadamard) product (() of two T × m matrices, i.e.
Mq (Y; θ) =
⎛ ⎞
{sign(Y1,1 − θ1 ) + (2q − 1)}2 , . . . , {sign(Ym,1 − θm ) + (2q − 1)}2
⎜ {sign(Y1,2 − θ1 ) + (2q − 1)}2 , . . . , {sign(Ym,2 − θm ) + (2q − 1)}2 ⎟
(0.5)2 ⎜
⎝
⎟
⎠
...
{sign(Y1,T − θ1 ) + (2q − 1)}2 , . . . , {sign(Ym,T − θm ) + (2q − 1)}2
The vector 1 is used to resize the vector Gq (·) into a T × m matrix. Then, at iteration step
(r+1)
(r + 1), θq,T (x) is simply computed by,
(r)
(r+1) {Y ( Wq (Y, x, X; θq,T , h)}
θq,T (x) = (r)
. (A.2)
{Wq (Y, x, X; θq,T , h)}
The sum in the above formula refers to the sum for each column and the division is
(r)
a direct division. Equation (A.2) shows that once θq,T is given, the solution to (A.1) at
iteration step r + 1 simply follows from applying weighted least squares.
The iteration is continued until two successive approximations of θq,T (x) are sufficiently
(r+1)
close. For the numerical illustration in this chapter, convergence is assumed if
θq,T (x) −
APPENDIX 12.B 525
θq,T (x)
2 10−3
Y − 1 × θq,T (x)
2 . The above algorithm is fully vectorized so that it
(r) (1)
can be easily implemented in matrix oriented software packages like GAUSS or MATLAB
(see, e.g., the file illustrate.m).
It is worth noting that the algorithm requires a good initial approximation of θq,T (x) to
start the iteration. We suggest the following approach. When q = 0.5, the conditional mean
can be taken as the starting value. For q > 0.5 or q < 0.5, one may start from the optimal
value for q = 0.5 and move upward or downward. For example, to estimate the conditional
quantile at q = 0.9, one may first estimate this quantity for q = 0.6 starting from q = 0.5.
Then estimate the conditional quantile for q = 0.7 starting from q = 0.6 and so on until the
end. In doing so, convergence to local optimum is facilitated.
Finally, it is interesting to mention that the proposed estimator is more efficient in
the sense that it requires less computing time than the corresponding univariate estimator.
This is the case even for dimension m as high as 7 or 8. This empirical evidence may
suggest that the fast converging property of the unconditional multivariate quantiles (see,
e.g., Chaudhuri, 1996) may also be shared by the conditional estimator defined above.
12.B
Percentiles of the R() Test Statistic
Following Harvill and Ray (2000, Section 2.2), we estimate the marginal densities by smooth-
ing the standardized data with a (scaled) second-order Student tν kernel-based density, as
given by
√
Γ (ν + 1)/2 / πν Γ(ν/2)
K(u) =
(ν+1)/2 , (B.1)
h 1 + u2 /(νh2 )
with ν = 4 degrees of freedom, and adopting a bandwidth h = 0.85T −1/5 . We estimate the
bivariate density of the pair of random variables (X, Y ) by a product kernel of Student’s t4
distributions with bandwidth
where ρX,Y is the correlation coefficient. Apart from the factor 0.85, this particular band-
width follows from minimizing the AMISE using a bivariate Gaussian kernel; see Scott (1992,
Section 6.3.1). The choice for the Student t4 kernel is motivated by the work of Hall and
Morton (1993). No boundary correction is needed in both kernel-based density computations
since the Student t distribution has infinite support. In addition, we estimate the integrals
in (1.18) numerically using a 30-point Gaussian quadrature. The limits of the integration
are chosen conservatively, as the minimum and maximum of the observed data.
Table 12.3 shows the empirical mean, standard deviation, and 90%, 95%, and 99%
percentile points of the R() test statistic for various sample sizes T , and lags using
1,000 MC replications. The results for T = 300 are in agreement with percentiles reported
by Harvill and Ray (2000, Table I). It is clear that R() is biased in finite samples. As
expected, the bias decreases as T increases. Joe (1989) and Hall and+ Morton (1993) show
that a summation-based estimator of the Shannon entropy H(X) = − log{fX (x)}fX (x)dx
of an m-dimensional random variable X, and thus of R(), is root-n consistent in m = 1, 2
and 3 dimensions. This result requires certain properties of the tails of the underlying
distribution.
526 12 VECTOR SEMI- AND NONPARAMETRIC METHODS
Table 12.4: Empirical mean, standard deviation, and percentile points of the R() test
statistic for dimension m = 2, various sample sizes T , and lags ; 1,000 MC replications.
Lag 90% 95% 99% Mean Std.dev 90% 95% 99% Mean Std.dev
T = 100 T = 200
1 0.2536 0.2604 0.2755 0.2282 0.0200 0.2214 0.2266 0.2370 0.2033 0.0139
2 0.2535 0.2620 0.2768 0.2287 0.0197 0.2225 0.2278 0.2361 0.2037 0.0142
3 0.2542 0.2636 0.2757 0.2290 0.0199 0.2228 0.2286 0.2388 0.2042 0.0144
4 0.2554 0.2629 0.2810 0.2298 0.0204 0.2241 0.2296 0.2385 0.2049 0.0144
5 0.2554 0.2616 0.2757 0.2294 0.0202 0.2233 0.2303 0.2421 0.2046 0.0145
T = 300 T = 400
1 0.2059 0.2101 0.2170 0.1894 0.0122 0.1925 0.1984 0.2074 0.1791 0.0109
2 0.2032 0.2082 0.2178 0.1885 0.0119 0.1928 0.1971 0.2042 0.1792 0.0107
3 0.2046 0.2088 0.2165 0.1891 0.0119 0.1936 0.1970 0.2067 0.1794 0.0108
4 0.2059 0.2102 0.2211 0.1901 0.0120 0.1934 0.1974 0.2046 0.1797 0.0108
5 0.2049 0.2101 0.2173 0.1900 0.0118 0.1935 0.1980 0.2042 0.1801 0.0103
T = 500 T = 1,000
1 0.1850 0.1888 0.1948 0.1717 0.0098 0.1888 0.1976 0.2059 0.1652 0.0196
2 0.1850 0.1881 0.1966 0.1712 0.0101 0.1900 0.1955 0.2033 0.1659 0.0201
3 0.1841 0.1881 0.1950 0.1717 0.0099 0.1873 0.1962 0.2071 0.1657 0.0188
4 0.1841 0.1880 0.1948 0.1719 0.0097 0.1880 0.1972 0.2066 0.1648 0.0193
5 0.1845 0.1883 0.1964 0.1717 0.0096 0.1851 0.1977 0.2069 0.1653 0.0193
Exercises
Theory Question
12.1 Consider the well-known property of the Kronecker product (A ⊗ B)(C ⊗ D) = AC ⊗
BD, if AC and BD exist. Using this property, verify (12.26).
(a) Compute the sample ACF and PACF matrices for lags = 1, . . . , 5. Discuss the
overall pattern of these statistics. Verify your observations with those made in
Example 11.5.
(b) Using the MATLAB code Rtest.m, compute the values of the R() test statistic
for = 1, . . . , 5. Determine the appropriate lags for inclusion in a vector NLAR
model.
[Note: For T = 66, the 5% critical values of the R() test statistic are given by
0.317 ( = 1), 0.315 ( = 2), 0.325 ( = 3), 0.315 ( = 4), and 0.326 ( = 5).]
12.3 The files earthP1.dat – earthP4.dat accompany the climate change data set of Example
1.5, but now covering each of the four climatic periods P1 – P4. Each file consists of
four time series variables: δ 13 C, δ 18 O, dust flux, and insolation.
EXERCISES 527
(a) Test for the presence of a nonlinear causal pairwise relationship between the four
series (all re-scaled) in time periods P4, P3, and P2, using the modified bivariate
nonparametric test statistic Q∗T,W (h) with bandwidth h = 1.5 (denoted by the
variable “epsilon” in the C and R codes). Use nominal significance levels of 1%
and 5% in all pairwise tests.
(b) Compare and contrast the test results in part (a) with those reported in Example
12.5 for time period P1.
12.4 Consider the Icelandic river flow data set introduced in Section 11.8. The dependent
variables are the daily river flow measured in m 3 /s, of the Jökulsá Eystri river (Q1,t ),
and Vatnsdalsá river (Q2,t ), i.e. 1,095 observations for analysis. The exogenous vari-
ables used in the model specification are lagged values of streamflow (Q1,t− , Q2,t− )
( = 1, . . . , 20), lagged values of precipitation (Pt−1 , Pt−2 , Pt−3 ), and contemporan-
eous and lagged values of temperature (Tt , Tt−1 ).
(a) Fit two PMARS models to the data: an unrestricted VARX model, and a re-
stricted (additive) VARX model. Use the GCV criterion for model selection
with default value d = 4. Find the unrestricted model with the lowest value of
ε |, i.e. the determinant of the residual covariance matrix.
|Σ
[Hint: Use the function polymars in the R-polspline package.]
(b) In part (a) you will notice that the “best” fitted unrestricted PMARS–VARX
model is attained at lag = 15. Compare the determinant of the residual
covariance matrix of this particular model with the determinant of the pooled
residual covariance matrix computed from Σ (1) (2)
ε and Σε given in Table 11.4 for
the VTARX model.
(c) Given the unrestricted PMARS–VARX model in part (b), consider only terms
with absolute coefficient value more than twice the estimated standard error.
Compare the resulting model with the nonlinear time series models presented in
Exercise 2.11 and Table 11.4.
(d) Test for the presence of a nonlinear causal relationship between the series {Q1,t }
and {Q2,t }, using the modified bivariate nonparametric test statistic Q∗T,W (h)
with h = 1.5 and embedding dimension Q1 = Q2 = 1, . . . , 8.
References∗
Aase, K.K. (1983). Recursive estimation in non-linear time series models of autoregressive
type. Journal of the Royal Statistical Society, B 45(2), 228–237. [248]
Abdous, B. and Theodorescu, R. (1992). Note on the spatial quantile of a random vector.
Statistics & Probability Letters, 13(4), 333–336.
DOI: 10.1016/0167-7152(92)90043-5. [496, 521]
Adhikari, R. (2015). A neural network based linear ensemble framework for time series
forecasting. Neurocomputing, 157(1), 231–242. DOI: 10.1016/j.neucom.2015.01.012. [430]
Aiolfi, M., Capistrán, C., and Timmermann, A. (2011). Forecast combinations. In M.P.
Clements and D.F. Hendry (Eds.), The Oxford Handbook of Economic Forecasting, Oxford
University Press, Oxford, UK, pp. 355–388.
DOI: 10.1093/oxfordhb/9780195398649.013.0013. [425]
Akamanam, S.I., Bhaskara Rao, M., and Subramanyam, K. (1986). On the ergodicity of
bilinear time series models. Journal of Time Series Analysis, 7(3), 157–163.
DOI: 10.1111/j.1467-9892.1986.tb00499.x. [110]
Aldous, D. (1989). Probability Approximation via the Poisson Clumping Heuristic. Applied
Mathematical Sciences 77, Springer-Verlag, New York. (Freely available at: http://en.
booksee.org/book/1304840). [170]
∗
A DOI (Digital Object Identifier) number can be converted to a web address with the URL
prefix http://dx.doi.org/. The URL will lead to the abstract of a paper or a book.
Al-Qassem, M.S. and Lane, J.A. (1989). Forecasting exponential autoregressive models of
order 1. Journal of Time Series Analysis, 10(2), 95–113.
DOI: 10.1111/j.1467-9892.1989.tb00018.x. [393, 401, 405, 428]
Alquist, R. and Killian, L. (2010). What do we learn from the price of crude oil futures?
Journal of Applied Econometrics, 25(4), 539–573. DOI: 10.1002/jae.1159. [431]
Amendola, A. and Francq, C. (2009). Concepts of and tools for nonlinear time series model-
ling. In E. Kontoghiorghes and D. Belsley (Eds.) Handbook of Computational Economet-
rics, Wiley, New York, pp. 377-427. DOI: 10.1002/9780470748916. See also the MPRA
working paper at http://mpra.ub.uni-muenchen.de/15140. [249]
Amendola, A., Niglio, M., and Vitale, C. (2006a). The moments of SETARMA models.
Statistics & Probability Letters, 76(6), 625–633. DOI: 10.1016/j.spl.2005.09.016. [111]
Amendola, A., Niglio, M., and Vitale, C. (2006b). Multi-step SETARMA predictors in the
analysis of hydrological time series. Physics and Chemistry of the Earth, 31(18), 1118–
1126. DOI: 10.1016/j.pce.2006.04.040. [395, 396]
Amendola, A., Niglio, M., and Vitale, C. (2007). The autocorrelation functions in
SETARMA models. In E. Kontoghiorghes and C. Gatu (Eds.) Optimisation, Econometric
and Financial Analysis. Springer-Verlag, New York, pp. 127–142.
DOI: 10.1007/3-540-36626-1 7. [111]
Amendola, A, Niglio, M., and Vitale, C. (2009a). Statistical properties of threshold models.
Communications in Statistics: Theory and Methods, 38(15), 2479–2497.
DOI: 10.1080/03610920802571146. [100]
Amendola, A, Niglio, M., and Vitale, C. (2009b). Threshold moving average models invert-
ibility. Available at: http://new.sis-statistica.org/wp-content/uploads/2013/
09/RS10-Threshold-Moving-Average-Models-Invertibility.pdf. [109]
Amisano, G. and Giacomini, R. (2007). Comparing density forecasts via weighted likelihood
ratio tests. Journal of Business & Economic Statistics, 25(2), 177–190.
DOI: 10.1198/073500106000000332. [427]
An, H.Z. and Chen S.G. (1997). A note on the ergodicity of non-linear autoregressive model.
Statistics & Probability Letters, 34(4), 365–372. DOI: 10.1016/s0167-7152(96)00204-0. [110]
An, H.Z. and Cheng, B. (1991). A Kolmogorov-Smirnov type statistic with application to
test for nonlinearity in time series. International Statistical Review, 59(3), 287–307.
DOI: 10.2307/1403689. [250]
References 531
An, H.Z., Zhu, L.X., and Li, R.Z. (2000). A mixed-type test for linearity in time series.
Journal of Statistical Planning and Inference, 88(2), 339–353.
DOI: 10.1016/S0378-3758(00)00087-2. [250]
Anderson, H.M. and Vahid, F. (1998). Testing multiple equation systems for common non-
linear components. Journal of Econometrics, 84(1), 1–36.
DOI: 10.1016/S0304-4076(97)00076-6. [457]
Anderson, H.M., Nam, K., and Vahid, F. (1999). Asymmetric nonlinear smooth transition
Garch models. In P. Rothman (Ed.) Non Linear Time Series Analysis of Economic and
Financial Data. Kluwer, Amsterdam, pp. 191–207.
DOI: 10.1007/978-1-4615-5129-4 10. [80]
Andrews, D.W.K. (1993). Tests for parameter instability and structural change with un-
known change point. Econometrica, 61(4), 821–856. DOI: 10.2307/2951764. [189]
Araújo Santos, P. and Fraga Alves, M.I. (2012). A new class of independence tests for
interval forecasts evaluation. Computational Statistics & Data Analysis, 56(11), 3366–
3380. DOI: 10.1016/j.csda.2010.10.002. [421]
Ashley, R.A., Patterson, D.M., and Hinich, M.J. (1986). A diagnostic test for nonlinear serial
dependence in time series fitting errors. Journal of Time Series Analysis, 7(3), 165–178.
DOI: 10.1111/j.1467-9892.1986.tb00500.x. [133, 147, 149]
Ashley, R.A. and Patterson, D.M. (1989). Linear versus nonlinear macroeconomies: A stat-
istical test. International Economic Review, 30(3), 685–704. DOI: 10.2307/2526783. [150]
Ashley R.A. and Patterson, D.M. (2002). Identification of coefficients in a quadratic mov-
ing average process using the generalized method of moments. Available at: http:
//ashleymac.econ.vt.edu/working_papers/E2003_5.pdf. [73]
Assaad, M., Boné, R., and Cardot, H. (2008). A new boosting algorithm for improved time-
series forecasting with recurrent neural networks. Information Fusion, 9(1), 41–55.
DOI: 10.1016/j.inffus.2006.10.009. [383]
Astatkie, T. (2006). Absolute and relative measures for evaluating the forecasting perform-
ance of time series models for daily streamflows. Nordic Hydrology, 37(3), 205–215.
DOI: 10.2166/nh.2006.008. [74]
532 References
Astatkie, T., Watt, W.E., and Watts, D.G. (1996). Nested threshold autoregressive (NeTAR)
models for studying sources of nonlinearity in streamflows. Nordic Hydrology, 27(5), 323–
336. [74, 75]
Astatkie, T., Watts, D.G., and Watt, W.E. (1997). Nested threshold autoregressive (NeTAR)
models. International Journal of Forecasting, 13(1), 105–116.
DOI: 10.1016/s0169-2070(96)00716-9. [49, 84]
Aue, A., Horváth, L., and Steinebach, J. (2006). Estimation in random coefficient autore-
gressive models. Journal of Time Series Analysis, 27(1), 61–76.
DOI: 10.1111/j.1467-9892.2005.00453.x. [73]
Auestad, B. and Tjøstheim, D. (1990). Identification of nonlinear time series: First order
characterization and order estimation. Biometrika, 77(4), 669–687.
DOI: 10.1093/biomet/77.4.669. [355, 382]
Avramidis, P. (2005). Two-step cross-validation selection method for partially linear models.
Statistica Sinica, 15(4), 1033–1048. [383]
Aznarte, J.L. and Benı́tez, J.M. (2010). Equivalences between neural-autoregressive time
series models and fuzzy systems. IEEE Transactions on Neural Networks, 21(9), 1434–
1444. DOI: 10.1109/tnn.2010.2060209. [75]
Aznarte, J.L., Benı́tez, J.M., and Castro, J.L. (2007). Smooth transition autoregressive
models and fuzzy rule-based systems: Functional equivalence and consequences. Fuzzy
Sets and Systems, 158(4), 2734–2745. DOI: 10.1016/j.fss.2007.03.021. [74]
Azzalini, A. and Bowman, A.W. (1990). A look at some data on the Old Faithful geyser.
Applied Statistics, 39(3), 357–365. DOI: 10.2307/2347385. [384]
Bacon, D.W. and Watts, D.G. (1971). Estimating the transition between two intersecting
straight lines. Biometrika, 58(3), 525–534. DOI: 10.1093/biomet/58.3.525. [74]
Baek, E.G. and Brock, W.A. (1992a). A nonparametric test for independence of a multivari-
ate time series. Statistica Sinica, 2(1), 137–156. [296, 515]
Baek, E.G. and Brock, W.A. (1992b). A general test for nonlinear Granger causality: Bivari-
ate model. Technical report, Department of Economics, University of Wisconsin. Available
at: http://www.ssc.wisc.edu/~wbrock/. [515]
Baek, J.S., Park, J.A., and Hwang, S.Y. (2012). Preliminary test of fit in a general class
of conditionally heteroscedastic nonlinear time series. Journal of Statistical Computation
and Simulation, 82(5), 763–781. DOI: 10.1080/00949655.2011.558087. [250]
Bagnato, L., De Capitani, L., and Punzo, A. (2014). Testing serial independence via density-
based measures of divergence. Methodology and Computing in Applied Probability, 16(3),
627–641. DOI: 10.1007/s11009-013-9320-4. [268, 269, 272, 273, 294, 297]
Bai, J. (2003). Testing parametric conditional distributions of dynamic models. The Review
of Economics and Statistics, 85(3), 531–549. DOI: 10.1162/003465303322369704. [427]
References 533
Bai, J. and Ng, S. (2005). Tests for skewness, kurtosis, and normality for time series data.
Journal of Business & Economic Statistics, 23(1), 49–60.
DOI: 10.1198/073500104000000271. [13]
Bai, Z., Wong, W.K., and Zhang, B. (2010). Multivariate linear and nonlinear causality
tests. Mathematics and Computers in Simulation, 81(1), 5–17.
DOI: 10.1016/j.matcom.2010.06.008. [516]
Balke, N.S. and Fomby, T.B. (1997). Threshold cointegration. International Economic Re-
view, 38(3), 627–645. DOI: 10.2307/2527284. [79, 80]
Banicescu, I., Carino, R.L., Harvill, J.L., and Lestrade, J.P. (2005). Simulation of vector
nonlinear time series models on clusters. In Proceedings of the 19th IEEE International
Parallel and Distributed Processing Symposium (IPDPS’05), pp. 4–8.
DOI: 10.1109/ipdps.2005.402. [522]
Banicescu, I., Carino, R.L., Harvill, J.L., and Lestrade, J.P. (2011). Investigating asymptotic
properties of vector nonlinear time series models. Journal of Computational and Applied
Mathematics, 236(3), 411–421. DOI: 10.1016/j.cam.2011.07.018. [522]
Bao, Y., Lee, T.-H., and Saltoǧlu, B. (2007). Comparing density forecast models. Journal
of Forecasting, 26(3), 203–225. DOI: 10.1002/for.1023. [426]
Baragona, R., Battaglia, F., and Cucina, D. (2004a). Fitting piecewise linear threshold
autoregressive models by means of genetic algorithms. Computational Statistics & Data
Analysis, 47(2), 277–295. DOI: 10.1016/j.csda.2003.11.003. [79, 210]
Baragona, R., Battaglia, F., and Cucina, D. (2004b). Estimating threshold subset autore-
gressive moving-average models by genetic algorithms. Metron, LXII, n. 1, 39–61. [80,
210]
Barkoulas, J.T., Baum, C.F., and Onochie, J. (1997). A nonparametric investigation of the
90-day T-bill rate. Review of Financial Economics, 6(2), 187–198.
DOI: 10.1016/s1058-3300(97)90005-7. [381]
Barnett, A.G. and Wolff, R.C. (2005). A time-domain test for some types of nonlinearity.
IEEE Transactions on Signal Processing, 53(1), 26–33.
DOI: 10.1109/tsp.2004.838942. [150]
Barnett, W.A., Gallant, A.R., Hinich, M.J., Jungeilges, J.A., Kaplan, D.T., and Jensen,
M.J. (1997). A single-blind controlled competition among tests for nonlinearity and chaos.
Journal of Econometrics, 82(1), 157–192. DOI: 10.1016/s0304-4076(97)00081-x. [151]
Barnett, W.A., Hendry, D.F., Hylleberg, S., Teräsvirta, T., Tjøstheim, D.J., and Würtz,
A. (Eds.) (2006). Nonlinear Econometric Modeling in Time Series. Cambridge University
Press, Cambridge, UK. [597]
534 References
Bartlett, M.S. (1954). A note on multiplying factors for various χ2 approximations. Journal
of the Royal Statistical Society, B 16(2), 296–298. [460]
Basrak, B., Davis, R.A., and Mikosch, T. (2002). Regular variation of GARCH processes.
Stochastic Processes and their Applications, 99(1), 95–115.
DOI: 10.1016/s0304-4149(01)00156-9. [97]
Bates, J.M. and Granger, C.W.J. (1969). The combination of forecasts. Operational Research
Quarterly, 20(4), 451–468. DOI: 10.2307/3008764. [430]
Battaglia,, F. and Orfei, L. (2005). Outlier detection and estimation in nonlinear time series.
Journal of Time Series Analysis, 26(1), 107–121.
DOI: 10.1111/j.1467-9892.2005.00392.x. [249]
Bazzi, M., Blasques, F., Koopman, S.J., and Lucas, A. (2014). Time varying transition
probabilities for Markov regime switching models. TI Discussion Paper, no. 14-072/III,
Amsterdam. Available at: http://papers.tinbergen.nl/14072.pdf.
DOI: 10.2139/ssrn.2456632. [75]
Beare, B.K. and Seo, J. (2014). Time-reversible copula-based Markov models. Econometric
Theory, 30(5), 923–960. DOI: 10.1017/s0266466614000115 [325, 332]
Bec, F., Guay, A., and Guerre, E. (2008). Adaptive consistent unit root tests based on
autoregressive threshold model. Journal of Econometrics, 142(1), 94–133.
DOI: 10.1016/j.jeconom.2007.05.011. [189]
Becker, R.A., Clark, L.A., and Lambert, D. (1994). Cave plots: A graphical technique for
comparing time series. Journal of Computational and Graphical Statistics, 3(3), 277–283.
DOI: 10.2307/1390912. [24]
Bekiros, S.D. and Diks, C. (2008). The nonlinear dynamic relationship of exchange rates:
Parametric and nonparametric causality testing. Journal of Macroeconomics, 30(4), 1641–
1650. DOI: 10.1016/j.jmacro.2008.04.001. [523]
Berg, A., Paparoditis, E., and Politis, D.N. (2010). A bootstrap test for time series linearity.
Journal of Statistical Planning and Inference, 140(12), 3841–3857.
DOI: 10.1016/j.jspi.2010.04.047. [136, 139, 140, 147]
Berkowitz, J., Christoffersen, P., and Pelletier, D. (2011). Evaluating value-at-risk models
with desk-level data. Management Science, 57(12), 2213–2227.
DOI: 10.1287/mnsc.1080.0964. [430]
Berlinet, A. and Francq, C. (1997). On Bartlett’s formula for non-linear processes. Journal
of Time Series Analysis, 18(6), 535–552. DOI: 10.1111/1467-9892.00067. [15]
References 535
Berlinet, A., Gannoun, A., and Matzner–Løber, E. (2001). Asymptotic normality of conver-
gent estimates of conditional quantiles. Statistics, 35(2), 139–169.
DOI: 10.1080/02331880108802728. [382]
Bermejo, M.A., Peña, D., and Sánchez, I. (2011). Identification of TAR models using re-
cursive estimation. Journal of Forecasting, 30(1), 31–50. DOI: 10.1002/for.1188. [250]
Bhansali, R.J. and Downham, D.Y. (1977). Some properties of the order of an autoregressive
model selected by a generalization of Akaike’s EPF criterion. Biometrika, 64(3), 547–551.
DOI: 10.1093/biomet/64.3.547. [231]
Billings, S.A. (2013). Nonlinear System Identification: NARMAX Methods in the Time,
Frequency, and Spatio-Temporal Domains. Wiley, New York.
DOI: 10.1002/9781118535561. [487]
Billings, S.A., Chen, S., and Korenberg, M.J. (1989). Identification of MIMO non-linear
systems using a forward regression orthogonal estimator. International Journal of Control,
49(6), 2157–2189. DOI: 10.1080/00207178908559767. [487]
Billingsley, P. (1995). Probability and Measure (3rd edn.). Wiley, New York. (Freely available
at: http://www.math.uoc.gr/~nikosf/Probability2013/3.pdf). [98]
Bilodeau, M. and Lafaye de Micheaux, P. (2009). A dependence statistic for mutual and
serial independence of categorical variables. Journal of Statistical Planning and Inference,
139(7), 2407–2419. DOI: 10.1016/j.jspi.2008.11.006. [296]
Birkelund, Y. and Hanssen, A. (2009). Improved bispectrum based tests for Gaussianity and
linearity. Signal Processing, 89(12), 2537–2546. DOI: 10.1016/j.sigpro.2009.04.013. [150]
Blum, J.R., Kiefer, J., and Rosenblatt, M. (1961). Distribution free tests of independence
based on the sample distribution function. Annals of Mathematical Statistics, 32(2), 485–
498. DOI: 10.1214/aoms/1177705055. [284, 285]
Blumentritt, T. and Grothe, O. (2013). Ranking ranks: A ranking algorithm for bootstrap-
ping from the empirical copula. Computational Statistics, 28(2), 455–462.
DOI: 10.1007/s00180-012-0310-8. [297]
536 References
Bosq, D. (1998). Nonparametric Statistics for Stochastic Processes (2nd edn.). Springer-
Verlag, New York. DOI: 10.1007/978-1-4684-0489-0. [338]
Boutahar, M. (2010). Behaviour of skewness, kurtosis and normality tests in long memory
data. Statistical Methods & Applications, 19(2), 193–215.
DOI: 10.1007/s10260-009-0124-1. [23]
Bowman, A.W. and Azzalini, A. (1997). Applied Smoothing Techniques for Data Analysis:
The Kernel Approach with S-Plus Illustrations. Oxford University Press, Oxford. [385]
Bowman, K.O. and Shenton, L.R. (1975). Omnibus contours for departures from normality
√
based on b1 and b2 . Biometrika, 62(2), 243–250. DOI: 10.1093/biomet/62.2.243. [22]
Box, G.E.P., Jenkins, G.M., and Reinsel, G.C. (2008). Time Series Analysis, Forecasting,
and Control (4th edn.). Wiley, New York. [1]
Brännäs, K. and De Gooijer, J.G. (2004). Asymmetries in conditional mean and variance:
Modelling stock returns by asMA-asQGARCH. Journal of Forecasting, 23(3), 155–171.
DOI: 10.1002/for.910. [74, 80]
Brännäs, K., De Gooijer, J.G., and Teräsvirta, T. (1998). Testing linearity against nonlinear
moving average models. Communications in Statistics: Theory and Methods, 27(8), 2025–
2035. DOI: 10.1080/03610929808832207. [74, 188, 189]
Brännäs, K., De Gooijer, J.G., Lönnbark, C., and Soultanaeva, A. (2011). Simultaneity
and asymmetry of returns and volatilities in the emerging Baltic state stock exchanges.
Studies in Nonlinear Dynamics & Econometrics, 16:1. DOI: 10.1515/1558-3708.1855. [74]
Breaker, L.C. (2006). Nonlinear aspects of sea surface temperature in Monterey Bay. Progress
in Oceanography, 69(1), 61–89. DOI: 10.1016/j.pocean.2006.02.015. [384]
Breaker, L.C. and Lewis, P.A.W. (1988). A 40–50 day oscillation in sea-surface temperature
along the Central California coast. Estuarine, Coastal and Shelf Science, 26(4), 395–408.
DOI: 10.1016/0272-7714(88)90020-0. [384]
References 537
Breidt, F.J. (1996). A threshold autoregressive stochastic volatility model. VI Latin Amer-
ican Congress of Probability and Mathematical Statistics (CLAPEM), Valparaiso, Chile.
[80]
Breidt, F.J. and Davis, R.A. (1992). Time-reversibility, identifiability and independence of
innovations for stationary time series. Journal of Time Series Analysis, 13(5), 377–390.
DOI: 10.1111/j.1467-9892.1992.tb00114.x. [333]
Breiman, L. and Friedman, J.H. (1985). Estimating optimal transformations for multiple
regression and correlation (with discussion). Journal of the American Statistical Associ-
ation, 80(391), 580–619. DOI: 10.1080/01621459.1985.10478157. [383]
Brillinger, D.R. (1975). Time Series Data Analysis and Theory . Holt, Rinehart and Winston,
New York. [142]
Brillinger, D.R. and Rosenblatt, M. (1967). Asymptotic theory of kth order spectra. In B.
Harris (Ed.) Spectral Analysis of Time Series. Wiley, New York, pp. 189–232 (see also
pp. 153–188). [149]
Brock, W.A., Hsieh, W.D., and LeBaron, B. (1991). Nonlinear Dynamics, Chaos, and In-
stability: Statistical Theory and Economic Evidence. MIT Press, Cambridge, MA. [282]
Brock, W.A., Dechert, W.D., LeBaron, B., and Scheinkman, J.A. (1996). A test for inde-
pendence based on the correlation dimension. Econometric Reviews, 15(3), 197–235.
DOI: 10.1080/07474939608800353. [279]
Brockett, P.L., Hinich, M.J., and Patterson, D.M. (1988). Bispectral-based tests for the
detection of Gaussianity and linearity in time-series. Journal of the American Statistical
Association, 83(403), 657–664. DOI: 10.2307/2289288. [150]
Brockett, R.W. (1976). Volterra series and geometric control theory. Automatica, 12(2),
167–176. DOI: 10.1016/0005-1098(76)90080-7. [72]
Brockett, R.W. (1977). Convergence of Volterra series on infinite intervals and bilinear ap-
proximations. In V. Lakshmikathan (Ed.) Nonlinear Systems and Applications. Academic
Press, New York, pp. 39–46. DOI: 10.1016/b978-0-12-434150-0.50009-6. [73]
Brockwell, P.J. (1994). On continuous time threshold ARMA processes. Journal of Statistical
Planning and Inference, 39(2), 291–304. DOI: 10.1016/0378-3758(94)90210-0. [44]
Brockwell, P.J. and Davis, R.A. (1991). Time Series: Theory and Methods (2nd edn.).
Springer-Verlag, New York. [1, 3]
Brockwell, P.J., Liu, J., and Tweedie, R.L. (1992). On the existence of stationary threshold
autoregressive moving-average processes. Journal of Time Series Analysis, 13(2), 95–107.
DOI: 10.1111/j.1467-9892.1992.tb00096.x. [100]
Brown, B.W. and Mariano, R.S. (1984). Residual-based procedures for prediction and es-
timation in a nonlinear simultaneous system. Econometrica, 52(2), 321–343.
DOI: 10.2307/1911492. [429]
538 References
Bryant, P.G. and Cordero–Braña, O.I. (2000). Model selection using the minimum descrip-
tion length principle. American Statistician, 54(4), 257–268. DOI: 10.2307/2685777. [249]
Brys, G., Hubert, M., and Struyf, A. (2004). A robustification of the Jarque-Bera test of nor-
mality. In J. Antoch (Ed.), COMPSTAT 2004 Symposium – Proceedings in Computational
Statistics. Physica-Verlag/Springer-Verlag, New York, pp. 753–760. [22]
Buchen, T. and Wohlrabe, K. (2011). Forecasting with many predictors: Is boosting a viable
alternative? Economics Letters, 113(1), 16–18. DOI: 10.1016/j.econlet.2011.05.040. [381]
Bühlmann, P. (2006). Boosting for high-dimensional linear models. The Annals of Statistics,
34(2), 559–583. DOI: 10.1214/009053606000000092. [371]
Bühlmann, P. and Yu, B. (2003). Boosting with the L2 loss: Regression and classification.
Journal of the American Statistical Association, 98(462), 324–339.
DOI: 10.1198/016214503000125. [371]
Burg, J.P. (1967). Maximum Entropy Spectral Analysis. Proceeding of the 37th Meeting of
the Society of Exploration, Geophysicists, Oklahoma City. Reprinted in D.G. Childers
(Ed.) (1978) Modern Spectral Analysis. IEEE Press, New York. [123]
Cai, Y. (2003). Convergence theory of a numerical method for solving the Chapman–
Kolmogorov equation. SIAM Journal on Numerical Analysis, 40(6), 2337–2351.
DOI: 10.1137/s0036142901390366. [428]
Cai, Y. (2005). A forecasting procedure for nonlinear autoregressive time series models.
Journal of Forecasting, 24(5), 335–351. DOI: 10.1002/for.959. [393]
Cai, Y. and Stander, J. (2008). Quantile self-exciting threshold autoregressive time series
models. Journal of Time Series Analysis, 29(1), 186–202.
DOI: 10.1111/j.1467-9892.2007.00551.x. [79]
Cai, Z., Fan, J., and Li, R. (2000a). Efficient estimation and inferences for varying-coefficient
models. Journal of the American Statistical Association, 95(451), 888–902.
DOI: 10.1080/01621459.2000.10474280. [384]
Cai, Z., Fan, J., and Yao, Q. (2000b). Functional-coefficient regression models for nonlinear
time series. Journal of the American Statistical Association, 95(451), 941–956.
DOI: 10.1080/01621459.2000.10474284. [374, 375]
Cai, Z., Li, Q., and Park, J.Y. (2009). Functional-coefficient models for nonstationary time
series data. Journal of Econometrics, 148(2), 101–113.
DOI: 10.1016/j.jeconom.2008.10.003. [384]
Camacho, M. (2004). Vector smooth transition regression models for US GDP and the
composite index of leading indicators Journal of Forecasting, 23(3), 173–196.
DOI: 10.1002/for.912. [487]
Campbell, S.D. (2007). A review of backtesting and backtesting procedures. Journal of Risk,
9(2), 1–18. [430]
References 539
Caner, M. and Hansen, B.E. (2001). Threshold autoregression with a unit root. Economet-
rica, 69(6), 1555–1596. DOI: 10.1111/1468-0262.00257. [189]
Casali, K.R, Casali, A.G., Montano, N., Irigoyen, M.C., Macagnan, F., Guzzetti, S., and
Porta, A. (2008). Multiple testing strategy for the detection of temporal irreversibility in
stationary time series. Physical Review, E77(6), 066204-1–066204-7.
DOI: 10.1103/physreve.77.066204. [333]
Casdagli, M. and Eubank, S. (Eds.) (1992). Nonlinear Modeling and Forecasting, Addison-
Wesley, Redwood City. [597]
Chan, K.S. (1988). On the existence of the stationary and ergodic NEAR(p) model. Journal
of Time Series Analysis, 9(4), 319–328. DOI: 10.1111/j.1467-9892.1988.tb00473.x. [74]
Chan, K.S. (1990). Testing for threshold autoregression. The Annals of Statistics, 18(4),
1886–1894. DOI: 10.1214/aos/1176347886. [170]
Chan, K.S. (1991). Percentage points of likelihood ratio tests for threshold autoregression.
Journal Royal Statistical Society, B 53(3), 691–696. [170, 191]
Chan, K.S. (1993). Consistency and limiting distribution of the least squares estimator of a
threshold autoregressive model. The Annals of Statistics, 21(1), 520–533.
DOI: 10.1214/aos/1176349040. [173, 247, 249]
Chan, K.S. and Tong, H. (1985). On the use of the deterministic Lyapunov function for
the ergodicity of stochastic difference equations. Advances in Applied Probability, 17(3),
666–678. DOI: 10.2307/1427125. [111]
Chan, K.S. and Tong, H. (1986). On estimating thresholds in autoregressive models. Journal
of Time Series Analysis, 7(3), 179–190.
DOI: 10.1111/j.1467-9892.1986.tb00501.x. [73, 74, 189]
Chan, K.S. and Tong, H. (1990). On likelihood ratio tests for threshold autoregression.
Journal Royal Statistical Society, B 52(3), 469–476. [170, 191]
Chan, K.S. and Tong, H. (2001). Chaos: A Statistical Perspective. Springer-Verlag, New
York. DOI: 10.1007/978-1-4757-3464-5. [597]
Chan, K.S. and Tong, H. (2010). A note on the invertibility of nonlinear ARMA models.
Journal of Statistical Planning and Inference, 140(12), 3707–3714.
DOI: 10.1016/j.jspi.2010.04.036. [107]
540 References
Chan, K.S. and Tsay, R.S. (1998). Limiting properties of the least squares estimator of a
continuous threshold autoregressive model. Biometrika, 85(2), 413–426.
DOI: 10.1093/biomet/85.2.413. [44, 45]
Chan, K.S., Ho, L.-H., and Tong, H. (2006). A note on time-reversibility of multivariate
linear processes. Biometrika, 93(1), 221–227. DOI: 10.1093/biomet/93.1.221. [333]
Chan, K.S., Petruccelli, J.D., Tong, H., and Woolford, S.W. (1985). A multiple-threshold
AR(1) model. Journal of Applied Probability, 22(2), 267–279. DOI: 10.2307/3213771. [100]
Chan, N.H. and Tran, L.T. (1992). Nonparametric tests for serial dependence. Journal of
Time Series Analysis, 13(1), 19–28. DOI: 10.1111/j.1467-9892.1992.tb00092.x. [271, 272]
Chan, W.S. and Cheung, S.H. (1994). On robust estimation of threshold autoregressions.
Journal of Forecasting, 13(1), 37–49. DOI: 10.1002/for.3980130106. [248]
Chan, W.S. and Tong, H. (1986). On tests for non-linearity in time series analysis. Journal
of Forecasting, 5(4), 217–228. DOI: 10.1002/for.3980050403. [129, 147]
Chan, W.S., Wong, A.C.S., and Tong, H. (2004). Some nonlinear threshold autoregressive
time series models for actuarial use. North American Actuarial Journal, 8(4), 37–61.
DOI: 10.1080/10920277.2004.10596170. [486]
Chan, W.S., Cheung, S.H., Chow, W.K., and Zhang, L-X. (2015). A robust test for threshold-
type nonlinearity in multivariate time series analysis. Journal of Forecasting, 34(6), 441–
454. DOI: 10.1002/for.2344. [487]
Chandra, S.A. and Taniguchi, M, (2001). Estimating functions for nonlinear time series
models. Annals Institute of Statistical Mathematics, 53(1), 125–141. [246, 248]
Chang, C.T. and Blondel, V.D. (2013). An experimental study of approximation algorithms
for the joint spectral radius. Numerical Algorithms, 64(1), 181–202.
DOI: 10.1007/s11075-012-9661-z. [455]
Charemza, W.W., Lifshits, M., and Makarova, S. (2005). Conditional testing for unit-root
bilinearity in financial time series: Some theoretical and empirical results. Journal of
Economic Dynamics & Control, 29(1-2), 63–96. DOI: 10.1016/j.jedc.2003.07.001. [189]
Chatfield, C. (1993). Calculating interval forecasts. Journal of Business & Economic Stat-
istics, 11(2), 121–135. DOI: 10.2307/1391361. [425]
Chen, C.W.S., Gerlach, R., Hwang, B.B.K., and McAleer, M. (2012). Forecasting Value-at-
Risk using nonlinear regression quantiles and the intra-day range. International Journal
of Forecasting, 28(3), 557–574. DOI: 10.1016/j.ijforecast.2011.12.004. [81]
References 541
Chen, C.W.S., McCulloch, R.E., and Tsay, R.S. (1997). A unified approach to estimating
and modeling linear and nonlinear time series. Statistica Sinica, 7(2), 451–472. [249]
Chen, C.W.S., Liu, F.C., and Gerlach, R. (2011a). Bayesian subset selection for threshold
autoregressive moving-average models. Computational Statistics, 26(1), 1–30.
DOI: 10.1007/s00180-010-0198-0. [210, 249]
Chen, C.W.S., So, M.K.P., and Liu, F.C. (2011b). A review of threshold time series models
in finance. Statistics and Its Interface, 4(2), 167–181.
DOI: 10.4310/sii.2011.v4.n2.a12. [73, 111]
Chen, D.Q. and Wang, H.B. (2011). The stationarity and invertibility of a class of nonlinear
ARMA models. Science China, Mathematics, 54(3), 469–478.
DOI: 10.1007/s11425-010-4160-y. [111, 384]
Chen, G., Abraham, B., and Bennett, G.W. (1997). Parametric and non-parametric model-
ling of time series – An empirical study. Environmetrics, 8(1), 63–74.
DOI: 10.1002/(sici)1099-095x(199701)8:1%3C63::aid-env238%3E3.0.co;2-b. [381]
Chen, H., Chong, T.T.L., and Bai, J. (2012). Theory and applications of TAR model with
two threshold variables. Econometric Reviews, 31(2), 142–170.
DOI: 10.1080/07474938.2011.607100. [189]
Chen, J. and Huo, X. (2009). A Hessian regularized nonlinear time series model. Journal of
Computational and Graphical Statistics, 18(3), 694–716.
DOI: 10.1198/jcgs.2009.08040. [384]
Chen, R., Liu, J.S., and Tsay, R.S. (1995). Additivity tests for nonlinear autoregression.
Biometrika, 82(2), 369–383. DOI: 10.1093/biomet/82.2.369. [383]
Chen, R. and Tsay, R.S. (1991). On the ergodicity of TAR(1) processes, Annals of Applied
Probability, 1(4), 613–634. DOI: 10.1214/aoap/1177005841. [100]
Chen, R. and Tsay, R.S. (1993a). Nonlinear additive ARX models. Journal of the American
Statistical Association, 88(423), 955–967. DOI: 10.2307/2290787. [381]
Chen, R. and Tsay, R.S. (1993b). Functional coefficient autoregressive models. Journal of
the American Statistical Association, 88(421), 298–308. DOI: 10.2307/2290725. [374]
542 References
Chen, R., Yang, K., and Hafner, C. (2004). Nonparametric multistep-ahead prediction in
time series analysis. Journal of the Royal Statistical Society, B 66(3), 669–686.
DOI: 10.1111/j.1467-9868.2004.04664.x. [382]
Chen, X., Linton, O., and Robinson, P.M. (2001). The estimation of conditional densities. In
M.L. Puri (Ed.) Asymptotics in Statistics and Probability, Festschrift for George Roussas.
VSP International Science Publishers, The Netherlands, pp. 71–84. Also available as LSE
STICERD Paper, No. EM/2001/415 (http://sticerd.lse.ac.uk/dps/em/em415.pdf).
[349]
Chen, Y.-T. (2003). Testing serial independence against time irreversibility. Studies in Non-
linear Dynamics & Econometrics, 7(3). DOI: 10.2202/1558-3708.1114. [321]
Chen, Y.-T. and Kuan, C.-M. (2002). Time irreversibility and EGARCH effects in US stock
index returns. Journal of Applied Econometrics, 17(5), 565–578.
DOI: 10.1002/jae.692. [321]
Chen, Y.-T., Chou, R.Y., and Kuan, C.-M. (2000). Testing time reversibility without mo-
ment restrictions. Journal of Econometrics, 95(1), 199–218.
DOI: 10.1016/s0304-4076(99)00036-6. [320, 321, 333]
Cheng, B. and Tong, H. (1992). On consistent non-parametric order determination and chaos
(with discussion). Journal of the Royal Statistical Society, B 54(2), 427–474.
DOI: 10.1142/9789812836281 0010. [383]
Cheng, C., Sa-ngasoongsong, A., Beyca, O., Le, T, Yang, H., Kong, Z., and Bukkapatnam,
S.T.S. (2015). Time series forecasting for nonlinear and non-stationary processes: A review
and comparative study. IIE Transactions, 47(10), 1053–1071.
DOI: 10.1080/0740817x.2014.999180. [427]
Cheng, Q. (1992). On the unique representation of non-Gaussian linear processes. The An-
nals of Statistics, 20(2), 1143–1145. DOI: 10.1214/aos/1176348677. [333]
Cheng, Y. and De Gooijer, J.G. (2007). On the uth geometric conditional quantile. Journal
of Statistical Planning and Inference, 137(6), 1914–1930.
DOI: 10.1016/j.jspi.2006.02.014. [521]
Chini, E.Z. (2013). Generalizing smooth transition autoregressions. CREATES research pa-
per 2013-32, Aarhus University. Available at: ftp://ftp.econ.au.dk/creates/rp/13/
rp13_32.pdf. Also available at: http://economia.unipv.it/docs/dipeco/quad/ps/
RePEc/pav/demwpp/DEMWP0114.pdf. [74]
Chung, Y.P. and Zhou, Z.G. (1996). The predictability of stock returns – a nonparametric
approach. Econometric Reviews, 15(3), 299–330. DOI: 10.1080/07474939608800357. [429]
Claeskens, G., Magnus, J.R., Vasnev, A.L., and Wang, W. (2016). The forecast combination
puzzle: A simple theoretical explanation. International Journal of Forecasting, 32(3),
754–762. DOI: 10.1016/j.ijforecast.2015.12.005. [425]
Clark, T.E. (2007). An overview of recent developments in forecast evaluation. Available at:
http://www.bankofcanada.ca/wp-content/uploads/2010/09/clark.pdf. [427]
Clark, T.E. and McCracken, M.W. (2001). Tests of equal forecast accuracy and encompassing
for nested models. Journal of Econometrics, 105(1), 85–110.
DOI: 10.1016/s0304-4076(01)00071-9. [417, 427]
Clark, T.E. and McCracken, M.W. (2005). Evaluating direct multistep forecasts. Economet-
ric Reviews, 24(4), 369–404. DOI: 10.1080/07474930500405683. [427]
Clark, T.E. and West, K.D. (2006). Using out-of-sample mean squared prediction errors to
test the martingale difference hypothesis. Journal of Econometrics, 135(1-2), 155–186.
DOI: 10.1016/j.jeconom.2005.07.014. [431]
Clark, T.E. and West, K.D. (2007). Approximately normal tests for equal predictive accuracy
in nested models. Journal of Econometrics, 138(1), 291–311.
DOI: 10.1016/j.jeconom.2006.05.023. [417]
Clements, M.P. (2005). Evaluating Econometric Forecasts of Economic and Financial Vari-
ables. Palgrave MacMillan, New York. DOI: 10.1057/9780230596146. [412, 422, 430, 431]
Clements, M.P. and Hendry, D.F. (1993). On the limitations of comparing mean squared
forecast errors. Journal of Forecasting, 12(8), 617–637 (with discussion).
DOI: 10.1002/for.3980120815. [479]
Clements, M.P. and Krolzig, H.-M. (1998). A comparison of the forecast performance of
Markov-switching and threshold autoregressive models of US GNP. Econometrics Journal,
1(1), C47–C75. DOI: 10.1111/1368-423x.11004. [429]
Clements, M.P. and Smith, J. (1997). The performance of alternative forecasting methods
for SETAR models. International Journal of Forecasting, 13(4), 463–475.
DOI: 10.1016/s0169-2070(97)00017-4. [407, 429]
Clements, M.P. and Smith, J. (1999). A Monte Carlo study of the forecasting performance
of empirical SETAR models. Journal of Applied Econometrics, 14(2), 124–141.
DOI: 10.1002/(sici)1099-1255(199903/04)14:2%3C123::aid-jae493%3E3.0.co;2-k. [429]
Clements, M.P. and Smith, J. (2000). Evaluating the forecast densities of linear and non-
linear models: Application to output growth and unemployment. Journal of Forecasting,
19(4), 255–276.
DOI: 10.1002/1099-131x(200007)19:4%3C255::aid-for773%3E3.0.co;2-g. [430]
Clements, M.P. and Smith, J. (2001). Evaluating forecasts from SETAR models of exchange
rates. Journal of International Money and Finance, 20(1), 133–148.
DOI: 10.1016/s0261-5606(00)00039-5. [429]
544 References
Clements, M.P. and Smith, J. (2002). Evaluating multivariate forecast densities: A compar-
ison of two approaches. International Journal of Forecasting, 18(3), 397–407.
DOI: 10.1016/s0169-2070(01)00126-1. [480, 492]
Clements, M.P. and Taylor, N. (2003). Evaluating interval forecasts of high frequency fin-
ancial data. Journal of Applied Econometrics, 18(4), 445–456. DOI: 10.1002/jae.703. [430]
Clements, M.P., Franses, P.H., Smith, J., and Van Dijk, D. (2003). On SETAR non-linearity
and forecasting. Journal of Forecasting, 22(5), 359–375. DOI: 10.1002/for.863. [429]
Cleveland, R.B., Cleveland, W.S., McRae, J.W., and Terpenning, I. (1990). STL: A seasonal-
trend decomposition procedure based on loess. Journal of Official Statistics, 6(1), 3–73
(with discussion). [386]
Cleveland, W.S. (1979). Robust locally weighted regression and smoothing scatterplots.
Journal of the American Statistical Association, 74(368), 829–836.
DOI: 10.2307/2286407. [353, 385]
Cleveland, W.S. and Devlin, S.J. (1988). Locally weighted regression: An approach to re-
gression analysis by local fitting. Journal of the American Statistical Association, 83(403),
596–610. DOI: 10.1080/01621459.1988.10478639. [353]
Cline, D.B.H. (2007a). Stability of nonlinear stochastic recursions with application to non-
linear AR-GARCH models. Advances in Applied Probability, 39(2), 462–491.
DOI: 10.1239/aap/1183667619. [93, 94]
Cline, D.B.H. (2007b). Regular variation of order 1 nonlinear AR-ARCH models. Stochastic
Processes and their Applications, 117(7), 840–861. DOI: 10.1016/j.spa.2006.10.009. [92]
Cline, D.B.H. (2007c). Evaluating the Lyapounov exponent and existence of moments for
threshold AR-ARCH models. Journal of Time Series Analysis, 28(2), 241–260.
DOI: 10.1111/j.1467-9892.2006.00508.x. [91, 92, 93]
Cline, D.B.H. and Pu, H.H. (1999a). Geometric ergodicity of nonlinear time series. Statistica
Sinica, 9(4), 1103–1118. [91]
Cline, D.B.H. and Pu, H.H. (1999b). Stability of nonlinear AR(1) time series with delay.
Stochastic Processes and their Applications, 82(2), 307–333.
DOI: 10.1016/s0304-4149(99)00042-3. [91]
Cline, D.B.H. and Pu, H.H. (2001). Geometric transience of nonlinear time series. Statistica
Sinica, 11(1), 273–287. [91]
Cline, D.B.H. and Pu, H.H. (2004). Stability and the Lyapounov exponent of threshold AR-
ARCH models. The Annals of Applied Probability, 14(4), 1920–1949.
DOI: 10.1214/105051604000000431. [91]
Coakley, J., Fuertes, A-M., and Pérez, M-T. (2003). Numerical issues in threshold autore-
gressive modeling of time series. Journal of Economic Dynamics & Control, 27(11-12),
2219–2242. DOI: 10.1016/s0165-1889(02)00123-9. [248]
Collomb, G., Härdle, W., and Hassani, S. (1987). A note on prediction via estimation of the
conditional mode function. Journal of Statistical Planning and Inference, 15 (1986-1987),
227–236. DOI: 10.1016/0378-3758(86)90099-6. [340]
Connor, J.T., Martin, D.R., and Atlas, L.E. (1994). Recurrent neural networks and robust
time series prediction. IEEE Transactions on Neural Networks, 5(2), 240–254.
DOI: 10.1109/72.279188. [75]
Corradi, V. and Swanson, N.R. (2006a). Predictive density evaluation. In G. Elliott et al.
(Eds.) Handbook of Economic Forecasting, North-Holland, Amsterdam, pp. 197–284.
DOI: 10.1016/s1574-0706(05)01005-0. [427]
Corradi, V. and Swanson, N.R. (2006b). Bootstrap conditional distribution tests in the
presence of dynamic misspecification. Journal of Econometrics, 133(2), 779–806.
DOI: 10.1016/j.jeconom.2005.06.013. [427]
Corradi, V. and Swanson, N.R. (2012). A survey of recent advances in forecast ac-
curacy comparison testing, with an extension to stochastic dominance. In X. Chen
and N.R. Swanson (Eds.) Causality, Prediction and Specification Analysis: Recent Ad-
vances and Future Directions. Essay in Honour of Halbert L. White Jr. Springer-
Verlag, New York. Available at: http://www2.warwick.ac.uk/fac/soc/economics/
staff/academic/corradi/research/corradi_swanson_whitefest_2012_02_09.pdf
and http://econweb.rutgers.edu/nswanson/papers/corradi_swanson_whitefest_
2012_02_09.pdf. [429]
Corradi, V., Swanson, N.R., and Olivetti, C. (2001). Predictive ability with cointegrated
variables. Journal of Econometrics, 104(2), 315–358.
DOI: 10.1016/s0304-4076(01)00086-0. [429]
Cox, D.R. (1981). Statistical analysis of time series: Some recent developments. Scand-
inavian Journal of Statistics, 8(2), 93–115 (with discussion). [315]
Cox, D.R. (1991). Long-range dependence, non-linearity and time irreversibility. Journal of
Time Series Analysis, 12(4), 329–335. DOI: 10.1111/j.1467-9892.1991.tb00087.x. [334]
Cressie, N. and Read, T.R.C. (1984). Multinomial goodness-of-fit tests. Journal of the Royal
Statistical Society, B 46(3), 440–464. [265]
Cryer, J.D. and Chan, K.S. (2008). Time Series Analysis: With Applications in R (2nd
edn.). Springer-Verlag, New York. DOI: 10.1007/978-0-387-75959-3. [251]
Cutler, C.D. (1991). Some results on the behavior and estimation of the fractal dimensions
of distributions on attractors. Journal of Statistical Physics, 62(3/4), 651–708.
DOI: 10.1007/bf01017978. [312]
Cutler, C.D. and Kaplan, D.T. (Eds.) (1996). Nonlinear Dynamics and Time Series: Build-
ing a Bridge between the Natural and Statistical Sciences. Fields Institute Communica-
tions, American Mathematical Society, Providence, Rhode Island. [597]
546 References
D’Alessandro, P., Isidori, A., and Ruberti, A. (1974). Realizations and structure theory of
bilinear dynamical systems. SIGMA Journal of Control, 12(3), 517–535.
DOI: 10.1137/0312040. [73]
Dagum, E.B., Bordignon, S., Cappuccio, N., Proietti, T., and Riani, M. (2004). Linear and
Non Linear Dynamics in Time Series. Pitagora Editrice, Bologna, Italy. [597]
Dai, Y. and Billard, L. (1998). A space-time bilinear model and its identification. Journal
of Time Series Analysis, 19(6), 657–679. DOI: 10.1111/1467-9892.00115. [485]
Dai, Y. and Billard, L. (2003). Maximum likelihood estimation in space time bilinear models.
Journal of Time Series Analysis, 24(1), 25–44. DOI: 10.1111/1467-9892.00291. [485]
Dalle Molle, J.W. and Hinich, M.J. (1995). Trispectral analysis of stationary random time
series. Journal of the Acoustical Society of America, 97(5), 2963–2978.
DOI: 10.1121/1.411860. [323]
Darolles, S., Florens, J.-P., and Gouriéroux, C. (2004). Kernel-based nonlinear canonical
analysis and time reversibility. Journal of Econometrics, 119(2), 323–353.
DOI: /10.1016/s0304-4076(03)00199-4. [333]
Davies, N. and Petruccelli, J.D. (1986). Detecting nonlinearity in time series. The Statisti-
cian, 35(2), 271–280. DOI: 10.2307/2987532. [193]
Daw, C.S., Finney, C.E.A., and Kennel, M.B. (2000). Symbolic approach for measuring
temporal irreversibility. Physical Review E, 62(2), 1912–1921.
DOI: 10.1103/physreve.62.1912. [333]
De Brabanter, J., Pelckmans, K., Suykens, J.A.K., and Vandewalle, J. (2005). Prediction
intervals for NAR model structures using a bootstrap method. Proceedings of the Inter-
national Symposium on Nonlinear Theory and its Applications (NOTA 2005), Bruges,
Belgium, pp. 610–613. Available at: http://www.ieice.org/proceedings/. [429]
De Gooijer, J.G. (1998). On threshold moving average models. Journal of Time Series
Analysis, 19(1), 1–18. DOI: 10.1111/1467-9892.00074. [248]
De Gooijer, J.G. (2001). Cross-validation criteria for SETAR model selection. Journal of
Time Series Analysis, 22(3), 267–281. DOI: 10.1111/1467-9892.00223. [235]
De Gooijer, J.G. (2007). Power of the Neyman smooth test for evaluating multivariate
forecast densities. Journal of Applied Statistics, 34(4), 371–382.
DOI: 10.1080/02664760701231526. [487]
De Gooijer, J.G. and Brännäs, K. (1995). Invertibility of non-linear time series models.
Communications in Statistics: Theory and Methods, 24(11), 2701–2714.
DOI: 10.1080/03610929508831644. [105]
References 547
De Gooijer, J.G. and De Bruin, P.T. (1998). On forecasting SETAR processes. Statistics &
Probability Letters, 37(1), 7–14. DOI: 10.1016/s0167-7152(97)00092-8. [401, 404, 432]
De Gooijer, J.G. and Gannoun, A. (2000). Nonparametric conditional predictive regions for
time series. Computational Statistics & Data Analysis, 33(3), 259–275.
DOI: 10.1016/s0167-9473(99)00056-0. [384, 413, 415, 429]
De Gooijer, J.G., Gannoun, A., and Zerom, D. (2001). Multi-stage kernel-based conditional
quantile prediction in time series. Communications in Statistics: Theory and Methods,
30(12), 2499–2515. DOI: 10.1081/sta-100108445. [344, 345, 346]
De Gooijer, J.G., Gannoun, A., and Zerom, D. (2002). Mean squared error properties of the
kernel-based multi-stage median predictor for time series. Statistics & Probability Letters,
56(1), 51–56. DOI: 10.1016/S0167-7152(01)00169-9. [382]
De Gooijer, J.G., Gannoun, A., and Zerom, D. (2006). A multivariate quantile predictor.
Communications in Statistics: Theory and Methods, 35(1), 133–147.
DOI: 10.1080/03610920500439570. [497, 500, 521]
De Gooijer, J.G. and Kumar, K. (1992). Some recent developments in non-linear time series
modelling, testing, and forecasting. International Journal of Forecasting, 8(2), 135–156.
DOI: 10.1016/0169-2070(92)90115-P. Corrigendum: (1993, p. 145). [190, 428]
De Gooijer, J.G. and Ray, B.K. (2003). Modeling vector nonlinear time series using POLY-
MARS. Computational Statistics & Data Analysis, 42(1-2), 73–90.
DOI: 10.1016/S0167-9473(02)00123-8. [522]
De Gooijer, J.G., Ray, B.K., and Kräger, H. (1998). Forecasting exchange rates using TS-
MARS. Journal of International Money and Finance, 17(3), 513–534.
DOI: 10.1016/S0261-5606(98)00017-5. [381]
De Gooijer, J.G. and Yuan, A. (2016). Non parametric portmanteau tests for detecting non
linearities in high dimensions. Communications in Statistics: Theory and Methods, 45(2),
385–399. DOI: 10.1080/03610926.2013.815209. [296]
De Gooijer, J.G. and Zerom, D. (2000). Kernel based multi-step-ahead prediction of the
U.S. short-term interest rate. Journal of Forecasting, 19(4), 335–353.
DOI: 10.1002/1099-131x(200007)19:4%3C335::aid-for777%3E3.3.co;2-v. [381]
548 References
De Gooijer, J.G. and Zerom, D. (2003). On conditional density estimation. Statistica Neer-
landica, 57(2), 159–176. DOI: 10.1111/1467-9574.00226. [348, 351]
de Lima, P.J.F. (1996). Nuisance parameter free properties of correlation integral based
statistics. Econometric Reviews, 15(3), 237–259. DOI: 10.1080/07474939608800354. [296]
de Lima, P.J.F. (1997). On the robustness of nonlinearity tests to moment condition failure.
Journal of Econometrics, 76(1-2), 251–280. DOI: 10.1016/0304-4076(95)01791-7. [190]
Denison, D.G.T., Mallick, B.K., and Smith, A.F.M. (1998). Bayesian MARS. Statistics and
Computing, 8(4), 337–346. [383]
Denker, M. and Keller, G. (1983). On U -statistics and v. Mises’ statistics for weakly depend-
ent processes. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 64(4),
505–522. [310, 518]
Deutsch, M., Granger, C.W.J., and Teräsvirta, T. (1994). The combination of forecasts
using changing weights. International Journal of Forecasting, 10(1), 47–57.
DOI: 10.1016/0169-2070(94)90049-3. [425]
Diebold, F.X. (2015). Comparing predictive accuracy, twenty years later: A personal per-
spective on the use and abuse of Diebold–Mariano tests (with discussion). Journal of
Business & Economic Statistics, 33(1), DOI: 10.2139/ssrn.2316240. [429]
Diebold, F.X. and Mariano, R.S. (1995). Comparing predictive accuracy. Journal of Business
& Economic Statistics, 13(3), 253–263. DOI: 10.2307/1392185. [416, 417, 424]
Diebold, F.X., Gunther, T.A., and Tay, A.S. (1998). Evaluating density forecasts with ap-
plications to financial risk management. International Economic Review, 39(4), 863–883.
DOI: 10.2307/2527342. [430]
Diebold, F.X., Hahn, J., and Tay, A.S. (1999). Multivariate density forecast evaluation and
calibration in financial risk management: High-frequency returns on foreign exchange.
Review of Economics and Statistics, 81(4), 661–673. DOI: 10.1162/003465399558526. [430]
Diebold, F.X., Tay, A.S., and Wallis, K.F. (1999). Evaluating density forecasts of inflation:
The survey of professional forecasters. In R.F. Engle and H. White (Eds.) Cointegration,
Causality and Forecasting, Festschrift in Honour of Clive W.J. Granger. Oxford University
Press, New York, pp. 76–90. DOI: 10.3386/w6228. [430]
References 549
Diks, C. (1999). Nonlinear Time Series Analysis: Methods and Applications. World Sci-
entific, Singapore. DOI: 10.1142/3823. [597]
Diks, C. (2009). Nonparametric tests for independence. In R.A. Meyers (Ed.) Encyclopedia
of Complexity and Systems Science. Springer-Verlag, New York, pp. 6252–6271.
DOI: 10.1007/978-0-387-30440-3 369. [262]
Diks, C., Van Houwelingen, J.C., Takens, F., and DeGoede, J. (1995). Reversibility as a
criterion for discriminating time series. Physics Letters, A 201(2-3), 221–228.
DOI: 10.1016/0375-9601(95)00239-y. [321, 327, 328]
Diks, C. and Mudelsee, M. (2000). Redundancies in the Earth’s climatological time series.
Physics Letters, A 275(5-6), 407–414. DOI: 10.1016/s0375-9601(00)00613-7. [24]
Diks, C. and Panchenko, V. (2005). A note on the Hiemstra–Jones test for Granger non-
causality. Studies in Nonlinear Dynamics & Econometrics, 9(2).
DOI: 10.2202/1558-3708.1234. [516]
Diks, C. and Panchenko, V. (2006). A new statistic and practical guidelines for nonpara-
metric Granger causality testing. Journal of Economic Dynamics & Control, 30(9-10),
1647–1669. DOI: 10.1016/j.jedc.2005.08.008. [516, 517, 518, 521]
Diks, C. and Panchenko, V. (2007). Nonparametric tests for serial independence based on
quadratic forms. Statistica Sinica, 17(1), 81–98. [277, 291]
Diks, C. and Wolski, M. (2016). Nonlinear Granger causality: Guidelines for multivariate
analysis. Journal of Applied Econometrics , 31(7), 1333 –1351.
DOI: 10.1002/jae.2495. [519, 520]
Diop, A. and Guégan, D. (2004). Tail behavior of a threshold autoregressive stochastic
volatility model. Extremes, 7(4), 367–375. DOI: 10.1007/s10687-004-3482-y. [80]
Dobrushin, R.L., Sukhov, Yu.M., and Fritz, J. (1988). A.N. Kolmogorov – the founder of
the theory of reversible Markov processes. Russian Mathematical Surveys, 43, 157–182;
translation from Uspekhi Matematicheskikh Nauk, 43(6) (1988), 167–188 (Russian).
DOI: 10.1070/rm1988v043n06abeh001985. [332]
Donner, R.V. and Barbosa, S.M. (Eds.) (2008). Nonlinear Time Series Analysis in the
Geosciences: Applications in Climatology, Geodynamics and Solar-Terrestrial Physics .
Springer-Verlag, New York. DOI: 10.1007/978-3-540-78938-3. [2, 597]
Doornik, J.A. and Hansen, H. (2008). An omnibus test for univariate and multivariate
normality. Oxford Bulletin of Economics and Statistics, 70, 927–939.
DOI: 10.1111/j.1468-0084.2008.00537.x. [22, 254]
Douc, R., Moulines, E., and Stoffer, D.S. (2014). Nonlinear Time Series: Theory, Methods,
and Applications with R Examples. Chapman & Hall/CRC Press, London. [250, 597]
Doukhan, P. (1994). Mixing. Properties and Examples. Lecture Notes in Statistics 85.
Springer-Verlag, New York. [95]
Drunat, J., Dufrenot, G., and Mathieu, L. (1998). Testing for linearity: A frequency do-
main approach. In C. Dunnis and B. Zhou (Eds.) Nonlinear Modelling of High Frequency
Financial Time Series. Wiley, New York, pp. 69–86 [151]
550 References
Dueker, M.J., Psaradakis, Z., Sola, M., and Spagnolo, F. (2011). Multivariate
contemporaneous-threshold autoregressive models. Journal of Econometrics, 160(2), 311–
325. DOI: 10.1016/j.jeconom.2010.09.011. [79, 486]
Dufour, J.-M., Lepage, Y., and Zeidan, H. (1982). Nonparametric testing for time series: A
bibliography. The Canadian Journal of Statistics, 10(1), 1–38.
DOI: 10.2307/3315073. [295]
Dumitrescu, E.L., Hurlin, C., and Madkour, J. (2013). Testing interval forecasts: A GMM-
based approach. Journal of Forecasting, 32(2), 97–110. DOI: 10.1002/for.1260. [430]
Dunis, C.L. and Zhou, B. (Eds.) (1998). Nonlinear Modelling of High Frequency Financial
Time Series. Wiley, New York. [597]
Dunn, P.K. and Smyth, G.K. (1996). Randomized quantile residuals. Journal of Computa-
tional and Graphical Statistics, 5(3), 236–244. DOI: 10.2307/1390802. [240]
Eckmann, J.-P., Amherst, S.O., and Ruelle, D. (1987). Recurrence plots of dynamical sys-
tems. Europhysics Letters, 4(9), 973–977. DOI: 10.1209/0295-5075/4/9/004. [19]
Eilers, P. and Marx, B. (1996). Flexible smoothing with B-splines and penalties. Statistical
Science, 11(2), 89–102. DOI: 10.1214/ss/1038425655. [372]
Elman, J.L. (1990). Finding structures in time. Cognitive Science, 14(2), 179–211. [74]
El-Shagi, M. (2011). An evolutionary algorithm for the estimation of threshold vector error
correction models. International Economics and Economic Policy, 8(4), 341–362.
DOI: 10.1007/s10368-011-0180-5. [453]
Embrechts, P., Lindskog, F., and McNeil, A.J. (2003). Modelling dependence with copulas
and applications to risk management. In S.T. Rachev (Ed.), Handbook of Heavy Tailed
Distributions in Finance, Elsevier, Chapter 8, pp. 329–384.
DOI: 10.1016/b978-044450896-6.50010-8. [306]
Enders, W. and Granger, C.W.J. (1998). Unit-root tests and asymmetry adjustment with
an example using the term structure of interest rates. Journal of Business & Economic
Statistics, 16(3), 304–311. DOI: 10.2307/1392506. [79, 189]
Engle, R.F. (2002). New frontiers for ARCH models. Journal of Applied Econometrics, 17(5),
425–446. DOI: 10.1002/jae.683. [74]
Engle, R.F. and Kozicki, S. (1993). Testing for common features. Journal of Business &
Economic Statistics, 11(4), 369–380. DOI: 10.2307/1391623. [456]
Ephraim, Y. and Merhav, N. (2002). Hidden Markov processes. IEEE Transactions on In-
formation Theory, 48(6), 1518–1569. DOI: 10.1109/tit.2002.1003838. [75]
Epps, T.W. (1987). Testing that a stationary time series is Gaussian. The Annals of Stat-
istics, 15(4), 1683–1698. DOI: 10.1214/aos/1176350618. [151]
Ertel, J.E. and Fowlkes, E.B. (1976). Some algorithms for linear spline and piecewise multiple
linear regression. Journal of the American Statistical Association, 71(355), 640–648.
DOI: 10.1080/01621459.1976.10481540. [182]
Fan, J. and Gijbels, I. (1996). Local Polynomial Modelling and Its Applications. Chapman
& Hall, London. DOI: 10.1007/978-1-4899-3150-4. [349, 382, 409]
Fan, J. and Yao, Q. (2003). Nonlinear Time Series: Nonparametric and Parametric Methods.
Springer-Verlag, New York. DOI: 10.1007/978-0-387-69395-8 4. [209, 374, 382, 386, 414, 597]
Fan, J., Yao, Q., and Cai, Z. (2003). Adaptive varying-coefficient linear models. Journal of
the Royal Statistical Society, B 65(1), 57–80. DOI: 10.1111/1467-9868.00372. [374]
Fan, J., Yao, Q., and Tong, H. (1996). Estimation of conditional densities and sensitivity
measures in nonlinear dynamic systems. Biometrika, 83(1), 189–206.
DOI: 10.1093/biomet/83.1.189. [383]
Fan, J. and Yim, T.H. (2004). A cross-validation method for estimating conditional densities.
Biometrika, 91(4), 819–834. DOI: 10.1093/biomet/91.4.819. [382]
Fassò, A. and Negri, I. (2002). Multi-step forecasting for nonlinear models of high frequency
ground ozone data: A Monte Carlo approach. Environmetrics, 13(4), 365–378.
DOI: 10.1002/env.544. [428]
Feigin, P.D. and Tweedie, R.L. (1985). Random coefficient autoregressive processes: A
Markov chain analysis of stationary and finiteness of moments. Journal of Time Series
Analysis, 6(1), 1–14. DOI: 10.1111/j.1467-9892.1985.tb00394.x. [96, 97]
Feo, T.A. and Resende, M.G.C. (1995). Greedy randomized adaptive search procedures.
Journal of Global Optimization, 6(2), 109–133. DOI: 10.1007/bf01096763. [74]
Ferguson, T.S., Genest, C., and Hallin, M. (2000). Kendall’s tau for serial dependence. The
Canadian Journal of Statistics, 28(3), 587–604. DOI: 10.2307/3315967. [16]
Fermanian, J.-D., and Scaillet, O. (2003). Nonparametric estimation of copulas for time
series. Jounal of Risk, 5(4), 25–54. DOI: 10.2139/ssrn.372142. [296]
Fernández, V.A., Gamero, M.D.J, and Garcı́a, J.M. (2008). A test for the two-sample prob-
lem based on empirical characteristic functions. Computational Statistics & Data Analysis,
52(7), 3730–3748. DOI: 10.1016/j.csda.2007.12.013. [296]
Ferrante, M., Fonseca, G., and Vidoni, P. (2003). Geometric ergodicity, regularity of the
invariant distribution and inference for a threshold bilinear Markov process. Statistica
Sinica, 13(2), 367–384. [111]
Findley, D.F. (1993). The overfitting principles supporting AIC. Statistical Research Division
Report: RR-93/04, U.S. Bureau of the Census statistical, Washington, DC. Abstract:
http://www.census.gov.edgekey.net/srd/www/abstract/rr93-4.html. [229]
Fiorentini, G, Sentana, E., and Calzolari, G. (2004). On the validity of Jarque–Bera normal-
ity test in conditionally heteroskedastic dynamic regression models. Economics Letters,
83(3), 307–312. DOI: 10.1016/j.econlet.2003.10.023. [23]
Fitzgerald, W.J., Smith, R.L., Walden, A.T., and Young, P.C. (Eds.) (2000). Nonlinear and
Nonstationary Signal Processing. Cambridge University Press, Cambridge, UK. [597]
Fong, W.M. (2003). Time reversibility tests of volume-volatility dynamics for stock returns.
Economics Letters, 81(1), 39–45. DOI: 10.1016/s0165-1765(03)00146-0. [333]
Fonseca, G. (2004). On the stationarity of first-order nonlinear time series models: Some
developments. Studies in Nonlinear Dynamics & Econometrics, 8(2).
DOI: 10.2202/1558-3708.1216. [111]
Fonseca, G. (2005). On the stability of nonlinear ARMA models. Quaderno della Facoltà
di Economia, 2005/3, Università dell’Insubria, Varese. Abstract: http://econpapers.
repec.org/paper/insquaeco/qf0503.htm. [111]
Forbes, C.S., Kalb, G.R.J., and Kofman, P. (1999). Bayesian arbitrage threshold analysis.
Journal of Business & Economic Statistics, 17(3), 364–372. DOI: 10.2307/1392294. [488]
Francis, B.B., Mougoué, M., and Panchenko, V. (2010). Is there a symmetric nonlinear
causal relationship between large and small firms? Journal of Empirical Finance, 17(1),
23–38. DOI: 10.1016/j.jempfin.2009.08.003. [523]
Francq, C. and Zakoı̈an, J.-M. (2005). The L2 -structures of standard and switching regime
GARCH models. Stochastic Processes and their Applications, 115(9), 1557–1582.
DOI: 10.1016/j.spa.2005.04.005. [110]
Francq, C. and Zakoı̈an, J.-M. (2010). GARCH Models: Structure, Statistical Inference and
Financial Applications. Wiley, New York. DOI: 10.1002/9780470670057. [25]
Franke, J. (2012). Markov switching time series models. In T. Subba Rao et al. (Eds.)
Time Series Analysis: Methods and Applications, Handbook of Statistics, Vol. 30. North-
Holland, Amsterdam, The Netherlands, pp. 99–122.
DOI: 10.1016/b978-0-444-53858-1.00005-3. [75]
References 553
Franke, J., Härdle, W., and Martin, D. (1984). Robust and Nonlinear Time Series Analysis.
Springer-Verlag, New York. [597]
Franke, J., Kreiss, J.-P., and Mammen, E. (2002). Bootstrap of kernel smoothing in nonlinear
time series. Bernoulli, 8(1), 1–37. Available at http://projecteuclid.org/euclid.bj/
1078951087. [382]
Franses, P.H. and Van Dijk, D. (2000). Nonlinear Time Series Models in Empirical Finance.
Cambridge University Press, Cambridge, UK. DOI: 10.1017/cbo9780511754067. [597]
Friedman, J.H. (1984a). A variable span scatterplot smoother. Laboratory for Computational
Statistics, Stanford University Technical Report No. 5. Available at: http://www.slac.
stanford.edu/cgi-wrap/getdoc/slac-pub-3477.pdf. [85]
Friedman, J.H. (1984b). SMART user’s guide. Technical Report LCS01, Laboratory for Com-
putational Statistics, Stanford University. Available at: https://statistics.stanford.
edu/sites/default/files/LCS%2001.pdf. [386]
Friedman, J.H. (1991). Multivariate adaptive regression splines. The Annals of Statistics,
19(1), 1–141 (with discussion). DOI: 10.1214/aos/1176347963. [365]
Friedman, J.H. and Stuetzle, W. (1981). Projection pursuit regression. Journal of the Amer-
ican Statistical Association, 76(376), 817–823.
DOI: 10.1080/01621459.1981.10477729. [364, 386]
Fukuchi, J.-I. (1999). Subsampling and model selection in time series analysis. Biometrika,
86(3), 591–604. DOI: 10.1093/biomet/86.3.591. [359]
Gabr, M.M. (1998). Robust estimation of bilinear time series models. Communications in
Statistics: Theory and Methods, 27(1), 41–53. DOI: 10.1080/03610929808832649. [248]
Galeano, P. and Peña, D. (2007). Improved model selection criteria for SETAR time series
models. Journal of Statistical Planning and Inference, 137(9), 2802–2814.
DOI: 10.1016/j.jspi.2006.10.014. [235]
Galka, A. (2000). Topics in Nonlinear Time Series Analysis – With Implications for EEG
Analysis. World Scientific, Singapore. DOI: 10.1142/9789812813237. [2, 597]
Galvão, A.B.C. (2006). Structural break threshold VARs for predicting US recessions using
the spread. Journal of Applied Econometrics, 21(4), 463–487.
DOI: 10.1002/jae.840. [74, 80]
Gao, J. (2007). Nonlinear Time Series: Semiparametric and Nonparametric Methods. Chap-
man & Hall/CRC, London. DOI: 10.1201/9781420011210. [80, 597]
Gao, J. and Tong, H. (2004). Semiparametric nonlinear time series model selection. Journal
of the Royal Statistical Society, B 66(2), 321–336.
DOI: 10.1111/j.1369-7412.2004.05303.x. [383]
Gao, J., Tjøstheim, D., and Yin, J. (2013). Estimation in threshold autoregressive models
with a stationary and a unit root regime. Journal of Econometrics, 172(1), 1–13.
DOI: 10.1016/j.jeconom.2011.12.006. [80]
Gao, W. and Tian, Z. (2009). Learning Granger causality graphs for multivariate nonlinear
time series. Journal of Systems Science and Systems Engineering, 18(1), 038–052.
DOI: 10.1007/s11518-009-5099-9. [523]
Garth, L.M. and Bresler, Y. (1996). On the use of asymptotics in detection and estimation.
IEEE Transactions on Acoustics, Speech, and Signal Processing, 44(5), 1304–1307.
DOI: 10.1109/78.502350. [133]
Gaver, D.P. and Lewis, P.A.W. (1980). First order autoregressive gamma sequences and
point processes. Advances in Applied Probability, 12(3), 727–745.
DOI: 10.2307/1426429. [433]
Gel, Y.R. and Gastwirth, J.L. (2008). A robust modification of the Jarque–Bera test of
normality. Economics Letters, 99(1), 30–32. DOI: 10.1016/j.econlet.2007.05.022. [22]
Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distribution and the Bayesian
restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence,
PAMI 6(6), 721–741. DOI: 10.1109/tpami.1984.4767596. [249]
Genest, C. and Rémillard, B. (2004). Tests of independence and randomness based on the
empirical copula process. Test, 13(2), 335–369. DOI: 10.1007/bf02595777. [286, 312]
Genest, C., Ghoudi, K., and Rémillard, B. (2007). Rank-based extensions of the Brock,
Dechert, and Scheinkman test. Journal of the American Statistical Association, 102(480),
1363–1376. DOI: 10.1198/016214507000001076. [282, 283]
Gerlach, R., Chen, C.W.S., and Chan, N.Y.C. (2011). Bayesian time-varying quantile fore-
casting for Value-at-Risk in financial markets. Journal of Business & Economic Statistics,
29(4), 481–492. DOI: 10.1198/jbes.2010.08203. [81]
Gharavi, R. and Anantharam, V. (2005). An upper bound for the largest Lyapunov exponent
of a Markovian product of nonnegative matrices. Theoretical Computer Science, 332(1-3),
543–557. DOI: 10.1016/j.tcs.2004.12.025. [111]
References 555
Ghoudi, K., Kulperger, R.J., and Rémillard, B. (2001). A nonparametric test of serial inde-
pendence for time series and residuals. Journal of Multivariate Analysis, 79(2), 191–218.
DOI: 10.1006/jmva.2000.1967. [286, 287, 288, 312]
Giannakis, G.B. and Tsatsanis, K. (1994). Time-domain tests for Gaussianity and time-
reversibility. IEEE Transactions on Signal Processing, 42(12), 3460–3472.
DOI: 10.1109/78.340780. [333]
Giannerini, S., Maasoumi, E., and Dagum, E.B. (2015). Entropy testing for nonlinear serial
dependence in time series. Biometrika, 102(3), 661–675.
DOI: 10.1093/biomet/asv007. [295]
Giordano, F. (2000). The variance of CLS estimators for a simple bilinear model. Quaderni
di Statistica, 2(2), 147–155. [248, 251]
Giordano, F. and Vitale, C. (2003). CLS asymptotic variance for a particular relevant bilinear
time series model. Statistical Methods & Applications, 12(2), 169–185.
DOI: 10.1007/s10260-003-0061-3. [248, 253]
Gneiting, T. (2011). Making and evaluating point forecasts. Journal of the American Stat-
istical Association, 106(494), 746–762. DOI: 10.1198/jasa.2011.r10138. [430]
Godambe, V.P. (1960). An optimum property of regular maximum likelihood equation. An-
nals of Mathematical Statistics, 31(4), 1208–1211. DOI: 10.1214/aoms/1177705693. [248]
Godambe, V.P. (1985). The foundations of finite sample estimation in stochastic processes.
Biometrika, 72(2), 419–428. DOI: 10.1093/biomet/72.2.419. [248]
Goldsheid, I. Ya. (1991). Lyapunov exponents and asymptotic behaviour of the product
of random matrices. In L. Arnold et al. (Eds.) Lyapunov Exponents. Lecture Notes in
Mathematics, Vol. 1486. Springer-Verlag, New York, pp. 23–37.
DOI: 10.1007/bfb0086655. [111]
Gonzalo, J. and Pitarakis, J.-Y. (2002). Estimation and model selection based inference in
single and multiple threshold models. Journal of Econometrics, 110(2), 319–352.
DOI: 10.1016/s0304-4076(02)00098-2. [195, 249]
Gouriéroux, C. and Jasiak, J. (2005). Nonlinear innovations and impulse responses with
application to VaR sensitivity. Annales d’Économie et de Statistique, 78, 1–33.
DOI: 10.2139/ssrn.757352. [78]
Grahn, T. (1995). A conditional least squares approach to bilinear time series estimation.
Journal of Time Series Analysis, 16(5), 509–529.
DOI: 10.1111/j.1467-9892.1995.tb00251.x. [217, 218, 220]
556 References
Granger, C.W.J. (1969). Investigating causal relations by econometric models and cross-
spectral methods. Econometrica, 37(3), 424–438. DOI: 10.2307/1912791. [514]
Granger, C.W.J. (1989). Combining forecasts – twenty years later. Journal of Forecasting,
8(3), 167–173. DOI: 10.1002/for.3980080303. [425, 430]
Granger, C.W.J. (1993). Strategies for modelling nonlinear time-series relationships. Eco-
nomic Record, 69(3), 233–238. DOI: 10.1111/j.1475-4932.1993.tb02103.x. [246, 429]
Granger, C.W.J. and Andersen, A.P. (1978a). An Introduction to Bilinear Time Series
Models. Vandenhoeck & Ruprecht, Göttingen. [73, 101, 115, 441, 597]
Granger, C.W.J. and Andersen, A.P. (1978b). On the invertibility of time series models.
Stochastic Processes and their Applications, 8(1), 87–92.
DOI: 10.1016/0304-4149(78)90069-8. [101]
Granger, C.W.J. and Lin, J.-L. (1994). Using the mutual information coefficient to identify
lags in nonlinear models. Journal of Time Series Analysis, 15(4), 371–384.
DOI: 10.1111/j.1467-9892.1994.tb00200.x. [18, 19, 271]
Granger, C.W.J., Maasoumi, E., and Racine, J. (2004). A dependence metric for possibly
nonlinear processes. Journal of Time Series Analysis, 25(5), 649–669.
DOI: 10.1111/j.1467-9892.2004.01866.x. [336]
Granger, C.W.J., White, H., and Kamstra, M. (1989). Interval forecasting: An analysis
based on ARCH-quantile estimators. Journal of Econometrics, 40(1), 87–96.
DOI: 10.1016/0304-4076(89)90031-6. [426]
Gretton, A., Bousquet, O., Smola, A.J., and Schölkopf, B. (2005). Measuring statistical
dependence with Hilbert-Schmidt norms. In S. Jain et al. (Eds.) 16th International Con-
ference on Algorithmic Learning Theory. Springer-Verlag, Berlin, pp. 63–77.
DOI: 10.1007/11564089 7. [296]
Grünwald, P.D., Myung, I.J., and Pitt, M.A. (Eds.) (2005). Advances in Minimum Descrip-
tion Length: Theory and Applications. MIT Press. [249]
References 557
Guay, A. and Scaillet, O. (2003). Indirect inference, nuisance parameter and threshold mov-
ing average models. Journal of Business & Economic Statistics, 21(1), 122–132.
DOI: 10.1198/073500102288618829. [74]
Guégan, D. (1994). Séries Chronologiques Non Linéaires à Temps Discret. Economica, Paris.
[597]
Guégan, D. and Pham, T.D. (1992). Power of the score test against bilinear time series
models. Statistica Sinica, 2(1), 157–169. [194]
Guégan, D. and Wandji, J.N. (1996). Power of the Lagrange multiplier test for certain
subdiagonal bilinear models. Statistics & Probability Letters, 29(3), 201–212.
DOI: 10.1016/0167-7152(95)00174-3. [188]
Guo, M. and Petruccelli, J. (1991). On the null recurrence and transience of a first-order
SETAR model. Journal of Applied Probability, 28(3), 584–592. DOI: 10.2307/3214493. [99]
Guo, M. and Tseng, Y.K. (1997). A comparison between linear and nonlinear forecasts for
nonlinear AR models. Journal of Forecasting, 16(7), 491–508.
DOI: 10.1002/(sici)1099-131x(199712)16:7%3C491::aid-for669%3E3.0.co;2-3. [433]
Guo, M., Bai, Z., and An, H.Z. (1999). Multi-step prediction for nonlinear autoregressive
models based on empirical distributions. Statistica Sinica, 9(2), 559–570. [400, 401]
Guo, Z.-F. and Shintani, M. (2011). Nonparametric lag selection for nonlinear additive
autoregressive models. Economics Letters, 111(2), 131–134.
DOI: 10.1016/j.econlet.2011.01.014. [383]
Györfi, L., Härdle, W., Sarda, P., and Vieu, P. (1989). Nonparametric Curve Estimation
from Time Series. Springer-Verlag, New York. DOI: 10.1007/978-1-4612-3686-3. [382]
Haldrup, N., Meitz, M., and Saikkonen, P. (Eds.) (2014). Essays in Nonlinear Time Series
Econometrics. Oxford University Press, Oxford, UK.
DOI: 10.1093/acprof:oso/9780199679959.001.0001. [597]
558 References
Hall, P. (1989). On projection pursuit regression. The Annals of Statistics, 17(2), 573–588.
DOI: 10.1214/aos/1176347126. [383]
Hall, P. and Minotte, M.C. (2002). Higher order data sharpening for density estimation.
Journal of the Royal Statistical Society, B 64(1), 141–157.
DOI: 10.1111/1467-9868.00329. [520]
Hall, P. and Morton, S.C. (1993). On the estimation of entropy. Annals of the Institute of
Statistical Mathematics, 45(1), 69–88. [525]
Hall, S.G. and Mitchell, J. (2007). Combining density forecasts. International Journal of
Forecasting, 23(1), 1–13. DOI: 10.1016/j.ijforecast.2006.08.001. [426]
Hallin, M. (1980). Invertibility and generalized invertibility of time series models. Journal
of the Royal Statistical Society, B 42(2), 210–212. [102, 111]
Hallin, M. and Puri, M.L. (1992). Rank tests for time series analysis: A survey. In D.R.
Brillinger et al. (Eds.) New Directions in Time Series Analysis, Part I. Springer-Verlag,
New York, pp. 111–153. [295]
Hamaker, E.L. (2009). Using information criteria to determine the number of regimes in
threshold autoregressive models. Journal of Mathematical Psychology, 53(6), 518–529.
DOI: 10.1016/j.jmp.2009.07.006. [249]
Hamilton, J.D. (1994). Time Series Analysis. Princeton University Press, Princeton, NJ.
[68]
Hannan, E.J. (1979). The statistical theory of linear systems. In P.R. Krishnaiah (Ed.)
Developments in Statistics, Vol. 2. Academic Press, New York, pp. 83–122. [22]
Hannan, E.J. and Deistler, M. (2012). The Statistical Theory of Linear Systems. Classics
in Applied Mathematics (CL70), SIAM, Philadelphia (Originally published: Wiley, New
York, 1988). DOI: 10.1137/1.9781611972191. [22]
Hansen, B.E. (1996). Inference when a nuisance parameter is not identified under the null
hypothesis. Econometrica, 64(2), 413–430. DOI: 10.2307/2171789. [171]
Hansen, B.E. (1997). Inference in TAR models. Studies in Nonlinear Dynamics & Econo-
metrics, 2(1). DOI: 10.2202/1558-3708.1024. [189]
Hansen, B.E. (1999). Testing for linearity. Journal of Economic Surveys, 13(5), 551–576.
DOI: 10.1111/1467-6419.00098. [172, 249]
Hansen, B.E. (2000). Sample splitting and threshold estimation. Econometrica, 68(3), 575–
603. DOI: 10.1111/1468-0262.00124. [189]
Hansen, B.E. (2005). Exact mean integrated squared error of higher order kernel estimators.
Econometric Theory, 21(06), 1031–1057. DOI: 10.1017/s0266466605050528. [305]
Hansen, B.E. (2011). Threshold autoregression in economics. Statistics and Its Interface,
4(2), 123–127. DOI: 10.4310/sii.2011.v4.n2.a4. [73]
References 559
Hansen, B.E. and Seo, B. (2011). Testing for two-regime threshold cointegration in vector
error-correction model. Journal of Econometrics, 110(2), 293–318.
DOI: 10.1016/s0304-4076(02)00097-0. [486]
Hansen, L.P. (1982). Large sample properties of generalised method of moments estimation.
Econometrica, 50(4), 1029–1054. DOI: 10.2307/1912775. [248]
Hansen, M. and Yu, B. (2001). Model selection and the principle of minimum description
length. Journal American Statistical Association, 96(454), 746–774.
DOI: 10.1198/016214501753168398. [249]
Härdle, W., Lütkepohl, H., and Chen, R. (1997). A review of nonparametric time series
analysis. International Statistical Review, 65(1), 49–72. DOI: 10.2307/1403432. [382]
Härdle, W. and Marron, J.S. (1985). Optimal bandwidth selection in nonparametric regres-
sion function estimation. The Annals of Statistics, 13(4), 1465–1481.
DOI: 10.1214/aos/1176349748. [305]
Härdle, W., Tsybakov, A., and Yang, L. (1998). Nonparametric vector autoregression.
Journal of Statistical Planning and Inference, 68(2), 221–245.
DOI: 10.1016/s0378-3758(97)00143-2. [499]
Härdle, W. and Vieu, P. (1992). Kernel regression smoothing of time series. Journal of Time
Series Analysis, 13(3), 209–232. DOI: 10.1111/j.1467-9892.1992.tb00103.x. [382]
Harvey, D.I., Leybourne, S.J., and Newbold, P. (1997). Testing the equality of prediction
mean squared errors. International Journal of Forecasting, 13(2), 281–291.
DOI: 10.1016/s0169-2070(96)00719-4. [418]
Harvey, D.I., Leybourne, S.J., and Newbold, P. (1998). Tests for forecast encompassing.
Journal of Business & Economic Statistics, 16(2), 254–263. DOI: 10.2307/1392581. [427]
Harvey, D.I., Leybourne, S.J., and Newbold, P. (1999). Forecast evaluation in the presence
of ARCH. Journal of Forecasting, 18(6), 435–445.
DOI: 10.1002/(sici)1099-131x(199911)18:6%3C435::aid-for762%3E3.0.co;2-b. [429]
Harvill, J.L. and Newton, H.J. (1995). Saddlepoint approximations for the difference of order
statistics. Biometrika, 82(1), 226–231. DOI: 10.2307/2337643. [133]
Harvill, J.L. and Ray, B.K. (1998). Testing for nonlinearity in a vector time series. Available
at: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.46.7136&rep=
rep1&type=pdf. [486]
Harvill, J.L. and Ray, B.K. (1999). A note on tests for nonlinearity in a vector time series.
Biometrika, 86(3), 728–734. DOI: 10.1093/biomet/86.3.728. [459, 462]
Harvill, J.L. and Ray, B.K. (2000). An investigation of lag identification tools for vector
nonlinear time series. Communications in Statistics: Theory and Methods, 29(8), 1677–
1702. DOI: 10.1080/03610920008832573. [513, 525]
560 References
Harvill, J.L. and Ray, B.K. (2005). A note on multi-step forecasting with functional coeffi-
cient autoregressive models. International Journal of Forecasting, 21(4), 717–727.
DOI: 10.1016/j.ijforecast.2005.04.012. [506, 522]
Harvill, J.L. and Ray, B.K. (2006). Functional coefficient autoregressive models for vector
time series. Computational Statistics & Data Analysis, 50(12), 3547–3566.
DOI: 10.1016/j.csda.2005.07.016. [506, 509]
Harvill, J.L., Ravishanker, N., and Ray, B.K. (2013). Bispectral-based methods for clustering
time series. Computational Statistics & Data Analysis, 64, 113–131.
DOI: 10.1016/j.csda.2013.03.001. [150]
Hastie, T. and Tibshirani, R. (1990). Generalized Additive Models. Chapman & Hall, London.
[372, 383]
Hastings, W.K. (1970). Monte Carlo sampling methods using Markov chains and their ap-
plications. Biometrika, 57(1), 97–109. DOI: 10.1093/biomet/57.1.97. [249]
Haykin, S. (Ed.) (1979). Nonlinear Methods of Spectral Analysis. Springer-Verlag, New York.
[597]
Hellinger, E. (1909). Neue Begründung der Theorie quadratischer Formen von unend-
lichvielen Veränderlichen. Journal für die reine und angewandte Mathematik, 136, 210–
271. [264]
Hendry, D.F. and Clements, M.P. (2004). Pooling of forecasts. Econometrics Journal, 7(1),
1–31. DOI: 10.1111/j.1368-423x.2004.00119.x. [426]
Henneke, J.S., Rachev, S.T., Fabozzi, F.J., and Nikolov, M. (2011). MCMC based estimation
of Markov switching ARMA-GARCH models. Applied Economics, 43(3), 259–297.
DOI: 10.1080/00036840802552379. [75]
Herrndorf, N. (1984). A functional central limit theorem for weakly dependent sequences of
random variables. The Annals of Probability, 12(1), 141–153.
DOI: 10.1214/aop/1176993379. [96]
Hertz, J., Krogh, A., and Palmer, R.G. (1992). Introduction to the Theory of Neural Com-
putation. Addison-Wesley, New York. [74]
Hiemstra, C. and Jones, J.D. (1994). Testing for linear and nonlinear Granger causality in
the stock price-volume relation. The Journal of Finance, 49(5), 1639–1664.
DOI: 10.2307/2329266. [515, 516]
References 561
Hili, O. (2001). Hellinger distance estimation of SSAR models. Statistics & Probability Let-
ters, 53(3), 305–314. DOI: 10.1016/s0167-7152(01)00086-4. [248]
Hili, O. (2003). Hellinger distance estimation of nonlinear dynamical systems. Statistics &
Probability Letters, 63(2), 177–184. DOI: 10.1016/s0167-7152(03)00080-4. [248]
Hili, O. (2008a). Hellinger distance estimation of general bilinear time series models. Stat-
istical Methodology, 5(2), 119–128. DOI: 10.1016/j.stamet.2007.06.005. [248]
Hinich, M.J. (1982). Testing for Gaussianity and linearity of stationary time series. Journal
of Time Series Analysis, 3(3), 169–176.
DOI: 10.1111/j.1467-9892.1982.tb00339.x. [119, 130, 131, 136]
Hinich, M.J. and Patterson, D.M. (1985). Evidence of nonlinearity in daily stock returns.
Journal of Business & Economic Statistics, 3(1), 69–77. DOI: 10.2307/1391691. [150]
Hinich, M.J. and Rothman, P. (1998). A frequency-domain test of time reversibility. Mac-
roeconomic Dynamics, 2(1), 72–88. [322, 323]
Hinich, M.J., Foster, J., and Wild, P. (2006). Structural change in macroeconomic time
series: A complex systems perspective. Journal of Macroeconomics, 28(1), 136–150.
DOI: 10.1016/j.jmacro.2005.10.009. [319]
Hinich, M.J. and Wolinsky, M.A. (1988). A test for aliasing using bispectral analysis. Journal
of the American Statistical Association, 83(402), 499–501.
DOI: 10.1080/01621459.1988.10478623. [150]
Hinich, M.J., Mendes, E.M., and Stone, L. (2005). Detecting nonlinearity in time series:
Surrogate and bootstrap approaches. Studies in Nonlinear Dynamics & Econometrics,
9(4). DOI: 10.2202/1558-3708.1268. [136, 151]
Hjellvik, V. and Tjøstheim, D. (1995). Nonparametric tests of linearity for time series.
Biometrika, 82(2), 351–368. DOI: 10.2307/2337413. [250]
Hjellvik, V. and Tjøstheim, D. (1996). Nonparametric statistics for testing linearity and
serial dependence. Journal of Nonparametric Statistics, 6(2-3), 223–251.
DOI: 10.1080/10485259608832673. [250]
Hjellvik, V., Yao, Q., and Tjøstheim, D. (1998). Linearity testing using local polynomial
approximation. Journal of Statistical Planning and Inference, 68(2), 295–321.
DOI: 10.1016/s0378-3758(97)00146-8. [250]
Holst, U., Lindgren, G., Holst, J., and Thuvesholmen, M. (1994). Recursive estimation in
switching autoregressions with a Markov regime. Journal of Time Series Analysis, 15(5),
489–506. DOI: 10.1111/j.1467-9892.1994.tb00206.x. [110, 250]
Hong, Y. (1998). Testing for pairwise serial independence via the empirical distribution
function. Journal of the Royal Statistical Society, B 60(2), 429–453.
DOI: 10.1111/1467-9868.00134. [272]
Hong, Y. (2000). Generalized spectral tests for serial dependence. Journal of the Royal
Statistical Society, B 62(3), 557–574. DOI: 10.1111/1467-9868.00250. [274, 275]
Hong, Y. and Lee, T.-H. (2003). Diagnostic checking for the adequacy of nonlinear time
series models. Econometric Theory, 19(6), 1065–1121.
DOI: 10.1017/s0266466603196089. [297]
Hong, Y. and White, H. (2005). Asymptotic distribution theory for nonparametric entropy
measures of serial dependence. Econometrica, 73(3), 837–901.
DOI: 10.1111/j.1468-0262.2005.00597.x. [270, 271, 272, 273]
Hoover, W.G. (1999). Time Reversibility, Computer Simulation, and Chaos. World Sci-
entific, Singapore. DOI: 10.1142/9789812815071. [333]
Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are
universal approximations. Neural Networks, 2(5), 359–366.
DOI: 10.1016/0893-6080(89)90020-8. [75]
Horová, I., Kolác̆ek, J., and Zelinka, J. (2012). Kernel Smoothing in MATLAB: Theory and
Practice of Kernel Smoothing. World Scientific, Singapore. DOI: 10.1142/8468. [385]
Hosking, J.R.M. (1980). The multivariate portmanteau statistic. Journal of American Stat-
istical Association, 75(371), 602–607. DOI: 10.1080/01621459.1980.10477520. [473]
Hou, F.Z., Ning, X.B., Zhuang, J.J., Huang, X.L., Fu, M.J., and Bian, C.H. (2011). High-
dimensional time irreversibility analysis of human interbeat intervals. Medical Engineering
& Physics, 33(3), 633–637. DOI: 10.1016/j.medengphy.2011.01.002. [333]
Hristova, D. (2005). Maximum likelihood estimation of a unit root bilinear model with an
application to prices. Studies in Nonlinear Dynamics & Econometrics, 9(1).
DOI: 10.2202/1558-3708.1199. [189]
Hsiao, C, Morimune, K., and Powell, J.L. (Eds.) (2011). Nonlinear Statistical Modeling.
Cambridge University Press, Cambridge, UK. DOI: 10.1017/cbo9781139175203. [597]
Huang, H. and Lee, T.H. (2010). To combine forecasts or to combine information? Econo-
metric Reviews, 29(5-6), 534–570. DOI: 10.1080/07474938.2010.481553. [425]
Huang, J.Z. and Yang, L. (2004). Identification of non-linear additive autoregressive models.
Journal of the Royal Statistical Society, B 66(2), 463–477.
DOI: 10.1111/j.1369-7412.2004.05500.x. [372, 383]
References 563
Huang, M, Sun, Y, and White, H. (2015). A flexible nonparametric test for conditional
independence. Econometric Theory. DOI: 10.1017/S0266466615000286. Also available at:
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2277240. [294]
Hubrich, K. and Teräsvirta, T. (2013). Thresholds and smooth transitions in vector autore-
gressive models. In T.B. Fomby et al. (Eds.) VAR Models in Econometrics - New De-
velopments and Applications: Essays in Honor of Christopher A. Sims. Emerald Group
Publishing Limited: Bingley, UK, Volume 32, pp. 273–326. Also available as CREATES
Research paper 2013-18 at ftp://ftp.econ.au.dk/creates/rp/13/rp13_18.pdf. [74,
80, 486]
Hung, Y. (2012). Order selection in nonlinear time series models with application to the
study of cell memory. The Annals of Applied Statistics, 6(3), 1256–1279.
DOI: 10.1214/12-aoas546. [79]
Hurvich, C.M. and Tsai, C.-L. (1989). Regression and time series model selection in small
samples. Biometrika, 76(2), 297–307. DOI: 10.1093/biomet/76.2.297. [230]
Hwang, S.Y., Basawa, I.V., and Reeves, J. (1994). The asymptotic distributions of residual
autocorrelations and related tests of fit for a class of nonlinear time series models. Statistica
Sinica, 4(1), 107–125. [250]
Hyndman, R.J. (1995). Highest-density forecast regions for nonlinear and non-normal time
series models. Journal of Forecasting, 14(5), 431–441.
DOI: 10.1002/for.3980140503. [414, 429]
Hyndman, R.J. (1996). Computing and graphing highest density regions. The American
Statistician, 50(2), 120–126. DOI: 10.2307/2684423. [414, 429]
Hyndman, R.J. and Yao, Q. (2002). Nonparametric estimation and symmetry tests for
conditional density functions. Journal of Nonparametric Statistics, 14(3), 259–278.
DOI: 10.1080/10485250212374. [382]
Ichimura, H. (1993). Semiparametric least squares (SLS) and weighted SLS estimation in
single-index models. Journal of Econometrics, 58(1-2), 71–120.
DOI: 10.1016/0304-4076(93)90114-k. [378]
Jacobs, P.A. and Lewis, P.A.W. (1977). A mixed autoregressive moving average exponential
sequence and point processes (EARMA 1,1). Advances in Applied Probability, 9(1), 87–
104. DOI: 10.2307/1425818. [74]
Jaditz, T. and Sayers, C.L. (1998). Out-of-sample forecast performance as a test for nonlin-
earity in time series. Journal of Business & Economics Statistics, 16(1), 110–117.
DOI: 10.2307/1392021. [387, 429]
564 References
Jahan, N. and Harvill, J.L. (2008). Bispectral-based goodness-of-fit tests of Gaussianity and
linearity of stationary time series. Communications in Statistics: Theory and Methods,
37(20), 3216–3227. DOI: 10.1080/03610920802133319. [134, 147]
Jarque, C.M. and Bera, A.K. (1987). A test for normality of observations and regression
residuals. International Statistical Review, 55(2), 163–172. DOI: 10.2307/1403192. [10]
Joe, H. (1989). Estimation of entropy and other functionals of a multivariate density. Annals
Institute of Statistical Mathematics, 41(4), 683–697. DOI: 10.1007/bf00057735. [525]
Joe, H. (1997). Multivariate Models and Dependence Concepts. Chapman & Hall, London.
DOI: 10.1201/b13150. [305]
Johnson, R.A. and Wichern, D.W. (2002). Applied Multivariate Statistical Analysis, (5th
edn.). Prentice Hall, New York. [462]
Jones, D.A. (1978). Nonlinear autoregressive processes. Proceedings of the Royal Society of
London, A 360(1700), 71–95. DOI: 10.1098/rspa.1978.0058. [73, 428]
Jones, D.A. (1978). Linearization of non-linear state equation. Bulletin of the Polish
Academy of Sciences, Technical Sciences, 54(1), 63–73. [429]
Jose, K.K. and Thomas, M.M. (2012). A product autoregressive model with log-Laplace
marginal distribution. Statistica, LXXII(3), 317–336. [74]
Kallenberg, W. (2009). Estimating copula densities using model selection techniques. Insur-
ance: Mathematics and Economics, 45(2), 209–223.
DOI: 10.1016/j.insmatheco.2009.06.006. [296]
Kalliovirta, L. and Saikkonen, P. (2010). Reliable residuals for multivariate nonlinear time
series models. Available at: http://blogs.helsinki.fi/saikkone/research/. [475, 476]
Kalliovirta, L., Meitz, M., and Saikkonen, P. (2015). A Gaussian mixture autoregressive
model for univariate time series. Journal of Time Series Analysis, 36(2), 247–266.
DOI: 10.1111/jtsa.12108. [296]
Kankainen, A. and Ushakov, N.G. (1998). A consistent modification of a test for independ-
ence based on the empirical characteristic function. Journal of Mathematical Sciences,
89(5), 1582–1589. DOI: 10.1007/bf02362283. [296]
Kantz, H. and Schreiber, T. (2004). Nonlinear Time Series Analysis (2nd edn.). Cambridge
University Press, Cambridge, UK. DOI: 10.1017/cbo9780511755798. [25, 597]
Kapetanios, G. (2000). Small sample properties of the conditional least squares estimator
in SETAR models. Economics Letters, 69(3), 267–276.
DOI: 10.1016/s0165-1765(00)00314-1. [246]
Kapetanios, G. (2001). Model selection in threshold models. Journal of Time Series Analysis,
22(6), 733–754. DOI: 10.1111/1467-9892.00251. [249]
References 565
Kapetanios, G. and Shin, Y. (2006). Unit root tests in three-regime SETAR models. Eco-
nometrics Journal, 9(2), 252–278. DOI: 10.1111/j.1368-423x.2006.00184.x. [189]
Karlsen, H. and Tjøstheim, D. (1988). Consistent estimates for the NEAR(2) and NLAR(2)
time series model. Journal of the Royal Statistical Society, B 50(2), 313–320. [74]
Karvanen, J. (2005). A resampling test for the total independence of stationary time series:
Application to the performance evaluation of ICA algorithms. Neural Processing Letters,
22(3), 311–324. DOI: 10.1007/s11063-005-0956-0. [296]
Keenan, D.M. (1985). A Tukey nonadditivity-type test for time series nonlinearity. Biomet-
rika, 72(1), 39–44. DOI: 10.1093/biomet/72.1.39. [179, 180]
Kemperman, J.H.D. (1987). The median of finite measures of Banach space. In Y. Dodge
(Ed.) Data Analysis Based on the L1 -norm and Related Methods. North-Holland, Ams-
terdam, pp. 217–230. [496]
Khan, S., Bandyopadhyay, S., Ganguly, A.R., Saigal, S., Erickson III, D.J., Protopopescu,
V., and Ostrouchov, G. (2007). Relative performance of mutual information estimation
methods for quantifying the dependence among short and noisy data. Physical Review, E
76(2), 026209. DOI: 10.1103/physreve.76.026209. [23]
Kilian, L. (1998). Small sample confidence intervals for impulse response functions. The
Review of Economics and Statistics, 80(2), 218–230.
DOI: 10.1162/003465398557465. [411, 412]
Kiliç, R. (2016). Tests for linearity in STAR models: SupWald and LM-type tests. Journal
of Time Series Analysis , 37(5), 660 – 674. DOI: 10.1111/jtsa.12180. [188]
Kim, C.J. and Nelson, C.R. (1999). State-Space Models with Regime Switching, Classical
and Gibbs-Sampling Approaches with Applications. The MIT Press, Cambridge, MA. [75]
Kim, J.H. (2003). Forecasting autoregressive time series with bias-corrected parameter es-
timators. International Journal of Forecasting, 19(3), 493–502.
DOI: 10.1016/s0169-2070(02)00062-6. [411]
Kim, T.S., Yoon, J.H., and Lee, H.K. (2002). Performance of a nonparametric multivariate
nearest neighbor model in the prediction of stock index returns. Asia Pacific Management
Review, 7(1), 107–118. [382]
Kim, W.K. and Billard, L. (1990). Asymptotic properties for the first-order bilinear time
series model. Communications in Statistics: Theory and Methods, 19(4), 1171–1183.
DOI: 10.1080/03610929008830255. [248]
566 References
Kim, W.K., Billard, L., and Basawa, I.V. (1990). Estimation for the first-order diagonal
bilinear time series model. Journal of Time Series Analysis, 11(3), 215–229.
DOI: 10.1111/j.1467-9892.1990.tb00053.x. [252, 253]
Kim, Y. and Lee, S. (2002). On the Kolmogorov–Smirnov type test for testing nonlinearity
in time series. Communcations in Statistics: Theory and Methods, 31(2), 299–309.
DOI: 10.1081/sta-120002653. [250]
Klement, E.P. and Mesiar, R. (2006). How non-symmetric can a copula be? Commentationes
Mathematicae Universitatis Carolinae, 47, 141–148. [325]
Knotters, M. and De Gooijer, J.G. (1999). TARSO modeling of water table depths. Water
Resources Research, 35(3), 695–705. DOI: 10.1029/1998WR900049. [80, 242, 245, 246, 251]
Ko, S.I.M. and Park, S.Y. (2013). Multivariate density forecast evaluation: A modified
approach. International Journal of Forecasting, 29(3), 431–441.
DOI: 10.1016/j.ijforecast.2012.11.006. [480, 492]
Koc̆enda, E. (2001). An alternative to the BDS test: Integration across the correlation
integral. Econometric Reviews, 20(3), 337–351. DOI: 10.1081/etc-100104938. [312]
Koc̆enda, E. and Briatka, L̆. (2005). Optimal range for the iid test based on integration
across the correlation integral. Econometric Reviews, 24(3), 265–296.
DOI: 10.1080/07474930500243001. [280]
Kock, A.B. and Teräsvirta, T. (2011). Forecasting with nonlinear time series models. In
M.P. Clements and D.F. Hendry (Eds.) The Oxford Handbook of Economic Forecasting,
Oxford University Press, Oxford, pp. 61–88.
DOI: 10.1093/oxfordhb/9780195398649.013.0004. [427]
Koizumi, K., Okamoto, N., and Seo, T. (2009). On Jarque–Bera tests for assessing multivari-
ate normality. Journal of Statistics: Advances in Theory and Applications, 1(2), 207–220.
Available at: http://www.scientificadvances.co.in/about-this-journal/4. [23]
Kojadinovic, I. and Yan, J. (2011). Tests of serial independence for continuous mul-
tivariate time series based on a Möbius decomposition of the independence empir-
ical copula process. Annals of the Institute of Statistical Mathematics, 63(2), 347–373.
DOI: 10.1007/s10463-009-0257-x. Available at: http://www.ism.ac.jp/editsec/aism/
pdf/10463_2009_Article_257.pdf. [287, 289]
Kolmogorov, A.N. (1936). Zur theorie der Markoffschenketten. Mathematische Annalen, 112,
155–160. [332]
Koop, G. and Potter, S.M. (1999). Dynamic asymmetries in U.S. unemployment. Journal
of Business & Economic Statistics, 17(3), 298–312. DOI: 10.2307/1392288. [208]
Koop, G. and Potter, S.M. (2001). Are apparent findings of nonlinearity due to structural
instability in economic time series? Econometrics Journal, 4(1), 37–55.
DOI: 10.1111/1368-423x.00055. [249]
References 567
Koop, G. and Potter, S.M. (2003). Bayesian analysis of endogenous delay threshold models.
Journal of Business & Economics Statistics, 21(1), 93–103.
DOI: 10.1198/073500102288618801. [79]
Koop, G., Pesaran, M.H., and Potter, S.M. (1996). Impulse response analysis in nonlinear
multivariate models. Journal of Econometrics, 74(1), 119–147.
DOI: 10.1016/0304-4076(95)01753-4. [77, 79, 489]
Kooperberg, C., Bose, S., and Stone, C.J. (1997). Polychotomous regression. Journal of the
American Statistical Association, 92(437), 117–127. DOI: 10.2307/2291455. [502]
Koul, H.L. and Schick, A. (1997). Efficient estimation in nonlinear autoregressive time series
models. Bernoulli, 3(3), 247–277. DOI: 10.2307/3318592. [248]
Kreiss, J.-P. and Lahiri, S.N. (2011). Bootstrap methods for time series. In T. Subba Rao
et al. (Eds.) Handbook of Statistics, Vol. 30. North-Holland, Amsterdam, pp. 3–26.
DOI: 10.1016/b978-0-444-53858-1.00001-6. [151]
Krishnamurthy, V. and Yin, G.G. (2002). Recursive algorithms for estimation of hidden
Markov models and autoregressive models with Markov regime. IEEE Transactions on
Information Theory, 48(2), 458–476. DOI: 10.1109/18.979322. [250]
Kristensen, D. (2009). On stationarity and ergodicity of the bilinear model with applications
to GARCH models. Journal of Time Series Analysis, 30(1), 125–144.
DOI: 10.1111/j.1467-9892.2008.00603.x. [110, 116]
Kumar, K. (1986). On the identification of some bilinear time series models. Journal of Time
Series Analysis, 7(2), 117–122. DOI: 10.1111/j.1467-9892.1986.tb00489.x. [125]
Kumar, K. (1988). Bivariate bilinear models and their specification. In R.R. Mohler (Ed.)
Nonlinear Time Series and Signal Processing, Lecture Notes in Control and Information
Sciences, 106. Springer-Verlag, Berlin, pp. 59–74. [486]
Kunitomo, N. and Sato, S. (2002). Estimation of asymmetrical volatility for asset prices:
The simultaneous switching ARIMA approach. Journal of the Japan Statistical Society,
32(2), 119–140. DOI: 10.14490/jjss.32.119. [80]
Lai, T.L. and Wei, C.Z. (1982). Least squares estimates in stochastic regression models with
applications to identification and control of dynamic systems. The Annals of Statistics,
10(1), 154–166. DOI: 10.1214/aos/1176345697. [447]
Lai, T.L and Wong S.P-S. (2001). Stochastic neural networks with applications to nonlinear
time series. Journal of the American Statistical Association, 96(455), 968–981.
DOI: 10.1198/016214501753208636. [75]
Lai, T.L. and Zhu, G. (1991). Adaptive prediction in non-linear autoregressive models and
control systems. Statistica Sinica, 1(2), 309–334. [429]
Lall, U. and Sharma, A. (1996). A nearest neighbor bootstrap for resampling hydrologic time
series. Water Resources Research, 32(3), 679–693. DOI: 10.1029/95wr02966. [381, 382]
Lall, U., Sangoyomi, T., and Abarbanel, H. (1996). Nonlinear dynamics of the Great Salt
Lake: Nonparametric short-term forecasting. Water Resources Research, 32(4), 975–985.
DOI: 10.1029/95wr03402. [387]
568 References
Lanne, M. and Saikkonen, P. (2003). Modeling the U.S. short-term interest rate by mixture
autoregressive processes. Journal of Financial Econometrics, 1(1), 96–125.
DOI: 10.1093/jjfinec/nbg004. [296]
Lanterman, A.D. (2001). Schwarz, Wallace, and Rissanen: Intertwinning themes in theories
of model selection. International Statistical Review, 69(2), 185–212.
DOI: 10.2307/1403813. [249]
Lapedes, A and Farber, R. (1987). Nonlinear Signal Processing Using Neural Networks:
Prediction and System Modelling. Technical Report LA-UR-87-2662. Los Alamos Na-
tional Laboratory, Los Alamos, New Mexico. Available at: http://permalink.lanl.
gov/object/tr?what=info:lanl-repo/lareport/LA-UR-87-2662. [75]
Lawrance, A.J. (1991). Directionality and reversibility in time series. International Statistical
Review, 59(1), 67–79. DOI: 10.2307/1403575. [333]
Lawrance, A.J. and Lewis, P.A.W. (1977). An exponential moving average sequence and
point process (EMA1). Journal of Applied Probability, 14(1), 98–113.
DOI: 10.2307/3213263. [74]
Lawrance, A.J. and Lewis, P.A.W. (1980). The exponential autoregressive-moving average
EARMA(p, q) process. Journal of the Royal Statistical Society, B 42(2), 150–161. [54]
Lawrance, A.J. and Lewis, P.A.W. (1981). A new autoregressive time series model in expo-
nential variables (NEAR(1)). Advances in Applied Probability, 13(4), 826–845.
DOI: 10.2307/1426975. [54]
Lawrance, A.J. and Lewis, P.A.W. (1985). Modelling and residual analysis of nonlinear
autoregressive time series in exponential variables. Journal of the Royal Statistical Society,
B 47(2), 165–202 (with discussion). [74]
Le, N.D., Martin, R.D., and Raftery, A.E. (1996). Modeling flat stretches, bursts, and out-
liers in time series using mixture transition distribution models. Journal of the American
Statistical Association, 91(436), 1504–1515. DOI: 10.2307/2291576. [296]
Lee, A.J. (1990). U-statistics: Theory and Practice. Marcel Dekker, New York. [308]
Lee, O. and Shin, D.W. (2000). On geometric ergodicity of the MTAR process. Statistics &
Probability Letters, 48(3), 229–237. DOI: 10.1016/s0167-7152(99)00208-4. [100]
Lee, O. and Shin, D.W. (2001). A note on stationarity of the MTAR process on the boundary
of the stationarity region. Economics Letters, 73(3), 263–268.
DOI: 10.1016/s0165-1765(01)00508-0. [99, 100]
Lee, T.-H., White, H., and Granger, C.W.J. (1993). Testing for neglected nonlinearity in time
series models: A comparison of neural network methods and alternative tests. Journal of
Econometrics, 56(3), 269–290. DOI: 10.1016/0304-4076(93)90122-l. [22, 188, 190]
References 569
Leistritz, L, Hesse W., Arnold M., and Witte H. (2006). Development of interaction meas-
ures based on adaptive non-linear time series analysis of biomedical signals. Biomedical
Engineering, 51(2), 64–69. DOI: 10.1515/bmt.2006.012. [451]
Lentz, J.-R. and Mélard, G. (1981). Statistical analysis of a non-linear model. In O.D.
Anderson and M.R. Perryman (Eds.) Time Series Analysis. North-Holland, Amsterdam,
pp. 287–293. [73]
León, C.A. and Massé, J.-C. (1992). A counterexample on the existence of the L1 -median.
Statistics & Probability Letters, 13(2), 117–120. DOI: 10.1016/0167-7152(92)90085-j. [496]
Lewis, P.A.W., McKenzie, E., and Hugus, D.K. (1989). Gamma processes. Communications
in Statistics: Stochastic Models, 5(1), 1–30. DOI: 10.1080/15326348908807096. [318, 335]
Lewis, P.A.W. and Ray, B.K. (1993). Nonlinear modeling of multivariate and categorical
time series using multivariate adaptive regression splines. In H. Tong (Ed.) Dimension
Estimation and Models. World Scientific, Singapore, pp. 136–169. [362, 384]
Lewis, P.A.W. and Ray, B.K. (1997). Modeling long-range dependence, nonlinearity, and
periodic phenomena in sea surface temperatures using TSMARS. Journal of the American
Statistical Association, 92(439), 881–893. DOI: 10.2307/2965552. [362, 368, 381, 384]
Lewis, P.A.W. and Ray, B.K. (2002). Nonlinear modeling of periodic threshold autoregres-
sions using Tsmars. Journal of Time Series Analysis, 23(4), 459–471.
DOI: 10.1111/1467-9892.00269. [383]
Lewis, P.A.W. and Stevens, J.G. (1991). Nonlinear modeling of time series using multivari-
ate adaptive regression splines (MARS). Journal of the American Statistical Association,
86(416), 864–877. DOI: 10.1080/01621459.1991.10475126. [381]
Li, C.W. and Li, W.K. (1996). On a double-threshold autoregressive heteroscedastic time
series model. Journal of Applied Econometrics, 11(3), 253–274.
DOI: 10.1002/(sici)1099-1255(199605)11:3%3C253::aid-jae393%3E3.0.co;2-8. [80, 225]
Li, D. (2012). A note on moving-average models with feedback. Journal of Time Series
Analysis, 33(6), 873–879. DOI: 10.1111/j.1467-9892.2012.00802.x. [106, 111]
Li, D. and He, C. (2012a). Testing common nonlinear features in nonlinear vector autore-
gressive models. Available at: http://ideas.repec.org/p/hhs/oruesi/2012_007.
html. [456, 486]
Li, D. and He, C. (2012b). Testing for linear cointegration against smooth-transition coin-
tegration. Available at: http://ideas.repec.org/p/hhs/oruesi/2012_006.html. [487]
Li, D. and He, C. (2013). Forecasting with vector nonlinear time series models. Working
papers 2013:8, Dalarna University, Sweden. Available at: http://www.diva-portal.
org/smash/get/diva2:606647/FULLTEXT02.pdf. [487, 492]
Li, D., Li, W.K., and Ling, S. (2011). On the least squares estimation of threshold autore-
gressive and moving-average models. Statistics and Its Interface, 4(2), 183–196.
DOI: 10.4310/sii.2011.v4.n2.a13. [204, 205, 206, 207, 208]
570 References
Li, D. and Ling, S. (2012). On the least squares estimation of multiple-regime threshold AR
models. Journal of Econometrics, 167(1), 240–253.
DOI: 10.1016/j.jeconom.2011.11.006. [208]
Li, D., Ling, S., and Li, W.K. (2013). Asymptotic theory on the least squares estimation of
threshold moving-average models. Econometric Theory, 29(03), 482–516.
DOI: 10.1017/S026646661200045X. [248]
Li, D., Ling, S., and Tong, H. (2012). On moving-average models with feedback. Bernoulli,
18(2), 735–745. DOI: 10.3150/11-bej352. [106, 111]
Li, D., Ling, S., and Zhang, R. (2016). On a threshold double autoregressive model. Journal
of Business & Economic Statistics, 34(1), 68–80.
DOI: 10.1080/07350015.2014.1001028. [81]
Li, G. and Li, W.K. (2008). Testing for threshold moving average with conditional heteros-
cedasticity. Statistica Sinica, 18(2), 647–665. [174, 189]
Li, G. and Li, W.K. (2011). Testing a linear time series model against its threshold extension.
Biometrika, 98(1), 243–250. DOI: 10.1093/biomet/asq074. [174, 175, 176, 190]
Li, J. (2011). Bootstrap prediction intervals for SETAR models. International Journal of
Forecasting, 27(2), 320–332. DOI: 10.1016/j.ijforecast.2010.01.013. [410, 411, 412]
Li, M.S. and Chan, K.S. (2007). Multivariate reduced-rank nonlinear time series modeling.
Statistica Sinica, 17(1), 139–159. [80]
Li, Q. and Racine, J.S. (2007). Nonparametric Econometrics: Theory and Practice. Prin-
ceton University Press, Princeton and Oxford. [298, 485, 597]
Li, W.K. (1992). On the asymptotic standard errors of residual autocorrelations in nonlinear
time series modelling. Biometrika, 79(2), 435–437.
DOI: 10.1093/biomet/79.2.435. [236, 250]
Li, W.K. (1993). A simple one degree of freedom test for time series nonlinearity. Statistica
Sinica, 3(1), 245–254. [186]
Li, W.K. (2004). Diagnostic Checks in Time Series. Chapman & Hall/CRC, New York.
(Freely available at: http://dlia.ir/Scientific/e_book/Science/General/006256.
pdf). DOI: 10.1201/9780203485606. [250]
Li, W.K. and Mak, T.K. (1994). On the squared residual autocorrelations in non-linear
time series with conditional heteroskedasticity. Journal of Time Series Analysis, 15(6),
627–636. DOI: 10.1111/j.1467-9892.1994.tb00217.x. [236]
Liang, R., Niu, C., Xia, Q., and Zhang, Z. (2015). Nonlinearity testing and modeling for
threshold moving average models. Journal of Applied Statistics, 42(12), 2614–2630.
DOI: 10.1080/02664763.2015.1043872. [189]
Liebscher, E. (2005). Towards a unified approach for proving geometric ergodicity and mixing
properties of nonlinear autoregressive processes. Journal of Time Series Analysis, 26(5),
669–689. DOI: 10.1111/j.1467-9892.2005.00412.x. [111, 114]
References 571
Lientz, B.P. (1970). Results on nonparametric modal intervals. SIAM Journal of Applied
Mathematics, 19(2), 356–366. DOI: 10.1137/0119034. [429]
Lientz, B.P. (1972). Properties of modal intervals. SIAM Journal of Applied Mathematics,
23(1), 1–5. DOI: 10.1137/0123001. [429]
Lii, K.-S. (1996). Nonlinear systems and higher-order statistics with applications. Signal
Processing, 53(2-3), 165–177. DOI: 10.1016/0165-1684(96)00084-9. [150]
Lii, K.-S. and Masry, E. (1995). On the selection of random sampling schemes for the spectral
estimation of continuous time processes. Journal of Time Series Analysis, 16(3), 291–311.
DOI: 10.1111/j.1467-9892.1995.tb00235.x. [150]
Lim, K.S. (1987). A comparative study of various univariate time series models for Canadian
lynx data. Journal of Time Series Analysis, 8(2), 161–176.
DOI: 10.1111/j.1467-9892.1987.tb00430.x. [293]
Lim, K.S. (1992). On the stability of a threshold AR(1) without intercepts. Journal of Time
Series Analysis, 13(2), 119–132. DOI: 10.1111/j.1467-9892.1992.tb00098.x. [100]
Lin, C.C. and Mudholkar, G.S. (1980). A simple test for normality against asymmetric
alternatives. Biometrika, 67(2), 455–461. DOI: 10.2307/2335489. [11]
Lin, T.C. and Pourahmadi, M. (1998). Nonparametric and non-linear models and data
mining in time series: A case-study on the Canadian lynx data. Applied Statistics, 47(2),
187–201. DOI: 10.1111/1467-9876.00106. [381]
Lindsay, B.G., Markatau, M., Ray, S., Yang, K., and Chen, S.-C. (2008). Quadratic distances
on probabilities: A unified foundation. The Annals of Statistics, 36(2), 983–1006.
DOI: 10.1214/009053607000000956. [261]
Ling, S. and Li, W.K. (1997). Diagnostic checking of nonlinear multivariate time series with
multivariate arch errors. Journal of Time Series Analysis, 18(5), 447–464.
DOI: 10.1111/1467-9892.00061. [487]
Ling, S. and Tong, H. (2005). Testing for a linear MA model against threshold MA models.
The Annals of Statistics, 33(6), 2529–2552.
DOI: 10.1214/009053605000000598. [102, 174, 189]
Ling, S. and Tong, H. (2011). Score based goodness-of-fit tests for time series. Statistica
Sinica, 21(4), 1807–1829. DOI: 10.5705/ss.2009.090. [250]
Ling, S., Tong, H., and Li, D. (2007). Ergodicity and invertibility of threshold moving-
average models. Bernoulli, 13(1), 161–168. DOI: 10.3150/07-bej5147. [102, 109]
572 References
Ling, S., Peng, L., and Zhu, F. (2015). Inference for a special bilinear time-series model.
Journal of Time Series Analysis, 36(1), 61–66. DOI: 10.1111/jtsa.12092. [248]
Liu, J. (1989a). A simple condition for the existence of some stationary bilinear time series.
Journal of Time Series Analysis, 10(1), 33–39.
DOI: 10.1111/j.1467-9892.1989.tb00013.x. [111]
Liu, J. (1989b). On the existence of a general-multiple bilinear time series. Journal of Time
Series Analysis, 10(4), 341–355. DOI: 10.1111/j.1467-9892.1989.tb00033.x. [443]
Liu, J. (1990). A note on causality and invertibility of a general bilinear time series model.
Advances in Applied Probability, 22(1), 247–250. DOI: 10.2307/1427608. [103]
Liu, J. (1995). On stationarity and asymptotic inference of bilinear time series models.
Statistica Sinica, 2(2), 479–494. [111]
Liu, J. and Brockwell, P.J. (1988). On the general bilinear time series model. Journal of
Applied Probability, 25(3), 553–564. DOI: 10.2307/3213984. [111]
Liu, J. and Susko, E. (1992). On strict stationarity and ergodicity of a non-linear ARMA
model. Journal of Applied Probability, 29(2), 363–373. DOI: 10.2307/3214573. [99, 100]
Liu, S.-I. (1985). Theory of bilinear time series models. Communications in Statistics: The-
ory and Methods, 4(10), 2549–2561. DOI: 10.1080/03610926.1985.10524941. [103]
Liu, S.-I. (2011). Testing for multivariate threshold autoregression. Studies in Mathematical
Sciences, 2(1), 1–20. DOI: 10.2139/ssrn.1360533. [465, 467]
Liu, W., Ling, S., and Shao, Q.-M. (2011). On non-stationary threshold autoregressive
models. Bernoulli, 17(3), 969–986. DOI: 10.3150/10-bej306. [247]
Lobato, I.N. and Velasco, C. (2004). A simple test for normality for time series. Econometric
Theory, 20(04), 671–689. DOI: 10.1017/s0266466604204030. [13]
Lomnicki, Z.A. (1961). Tests for departure from normality in the case of linear stochastic
processes. Metrika, 4(1), 37–62. DOI: 10.1007/bf02613866. [12, 242]
Lopes, H.F. and Salazar, E. (2006). Bayesian model uncertainty in smooth transition autore-
gressions. Journal of Time Series Analysis, 27(1), 99-117.
DOI: 10.1111/j.1467-9892.2005.00455.x. [74]
Lutz, R.W., Kalisch, M., and Bühlmann, P. (2008). Robustified L2 boosting. Computational
Statistics & Data Analysis, 52(7), 3331–3341. DOI: 10.1016/j.csda.2007.11.006. [383]
Luukkonen, R., Saikkonen, P., and Teräsvirta, T. (1988a). Testing linearity against smooth
transition autoregressive models. Biometrika, 75(3), 491–499.
DOI: 10.2307/2336599. [159, 165, 181, 188, 193]
Luukkonen, R., Saikkonen, P., and Teräsvirta, T. (1988b). Testing linearity in univariate
time series. Scandinavian Journal of Statistics, 15(3), 161–175. [180, 188, 193]
Ma, J. and Wohar, M. (Eds.) (2014). Recent Advances in Estimating Nonlinear Models with
Applications in Economics and Finance. Springer-Verlag, New York.
DOI: 10.1007/978-1-4614-8060-0. [430, 597]
References 573
MacNeill, I.B. (1971). Limit processes of co-spectral and quadrature spectral distribution
function. The Annals of Statistics, 42(1), 81–96 DOI: 10.1214/aoms/1177693497. [184]
Mak, T.K. (1993). Solving non-linear estimation equations. Journal of the Royal Statistical
Society, B 55(4), 945–955. [223]
Mak, T.K., Wong, H., and Li, W.K. (1997). Estimation of nonlinear time series with con-
ditional heteroscedastic variances by iteratively weighted least squares. Computational
Statistics & Data Analysis, 24(2), 169–178. DOI: 10.1016/s0167-9473(96)00060-6. [223]
Marek, T. (2005). On the invertibility of a random coefficient moving average model. Ky-
bernetika, 41(6), 743–756. [102, 103]
Mariano, R.S. and Preve, D. (2012). Statistical tests for multiple forecast comparison.
Journal of Econometrics, 169(1), 123–130. DOI: 10.1016/j.jeconom.2012.01.014. [429]
Marinazzo, D., Pellicoro, M., and Stramaglia, S. (2008). Kernel method for nonlinear
Granger causality. Physics Review Letters, A, 100(14). Article 144103.
DOI: 10.1103/physrevlett.100.144103. [523]
Marron, J.S. (1994). Visual understanding of higher order kernels. Journal of Computational
and Graphical Statistics, 3(4), 447–458. DOI: 10.2307/1390905. [305]
Masry, E. (1996a). Multivariate local polynomial regression for time series: Uniform strong
consistency and rates. Journal of Time Series Analysis, 17(6), 571–599.
DOI: 10.1111/j.1467-9892.1996.tb00294.x. [382]
Masry, E. (1996b). Multivariate regression estimation: Local polynomial fitting for time
series. Stochastic Processes and their Applications, 65(1), 81–101.
DOI: 10.1016/s0304-4149(96)00095-6. [382]
Matsuda, Y. and Huzii, M. (1997). Some statistical properties of linear and nonlinear pre-
dictors for stationary time series. Research Report on Mathematical and Computing Sci-
ences, B-325, Tokyo Institute of Technology. Abstract: http://www.is.titech.ac.jp/
~natsuko/B/B-325.txt. [145, 437]
574 References
Matusita, K. (1955). Decision rules, based on the distance, for problems of fit, two samples,
and estimation. Annals of Mathematical Statistics, 26(4), 631–641.
DOI: 10.1214/aoms/1177728422. [336]
Matzner–Løber, E., Gannoun, A., and De Gooijer, J.G. (1998). Nonparametric forecasting:
A comparison of three kernel-based methods. Communications in Statistics: Theory and
Methods, 27(7), 1593–1617. DOI: 10.1080/03610929808832180. [341, 382]
McAleer, M. and Medeiros, M.C. (2008). A multiple regime smooth transition heterogeneous
autoregressive model for long memory and asymmetries. Journal of Econometrics, 147(1),
104–119. DOI: 10.1016/j.jeconom.2008.09.032. [76]
McCarthy, M. (2005). The lynx and the snowshoe hare: Which factors cause the cyclical os-
cillations in the population? Available as a PPT download at: http://www.slideserve.
com/angeni. [292]
McCausland, W.J. (2007). Time reversibility of stationary regular finite-state Markov chains.
Journal of Econometrics, 136(3), 303–318. DOI: 10.1016/j.jeconom.2005.09.001. [332]
McKeague, I.W. and Zhang, M.-J. (1994). Identification of nonlinear time series from first
order cumulative characteristics. The Annals of Statistics, 22(1), 495–514.
DOI: 10.1214/aos/1176325381. [383]
McLeod, A.I., Yu, H., and Mahdi, E. (2012). Time series analysis with R. In T. Subba Rao
et al. (Eds.) Handbook of Statistics 30: Time Series Analysis: Methods and Applications.
Elsevier, Amsterdam, pp. 661–712. [24]
McQuarrie, A.D.R., Shumway, R., and Tsai, C.-L. (1997). The model selection criterion
AICu. Statistics & Probability Letters, 34(3), 285–292.
DOI: 10.1016/s0167-7152(96)00192-7. [230]
McQuarrie, A.D.R. and Tsai, C.-L. (1998). Regression and Time Series Model Selection.
World Scientific, Singapore. DOI: 10.1142/9789812385451. [230, 231]
Medeiros, M.C., Teräsvirta, T., and Rech, G. (2006). Building neural network models for
time series: A statistical approach. Journal of Forecasting, 25(1), 49–75.
DOI: 10.1002/for.974. [188]
Medeiros, M.C. and Veiga, A. (2002). A hybrid linear-neural model for time series forecast-
ing. IEEE Transactions on Neural Networks, 11(6), 1402–1412.
DOI: 10.1109/72.883463. [75]
Medeiros, M.C. and Veiga, A. (2003). Diagnostic checking in a flexible nonlinear time series
model. Journal of Time Series Analysis, 24(4), 461–482.
DOI: 10.1111/1467-9892.00316. [75]
References 575
Medeiros, M.C. and Veiga, A. (2005). A flexible coefficient smooth transition time series
model. IEEE Transactions on Neural Networks, 16(1), 97–113.
DOI: 10.1109/tnn.2004.836246. [75, 188, 247]
Medeiros, M.C., Veiga, A., and Resende, M.G.C. (2002). A combinatorial approach to piece-
wise linear time series analysis. Journal of Computational and Graphical Statistics, 11(1),
236–258. DOI: 10.1198/106186002317375712. [73]
Meitz, M. and Saikkonen, P. (2010). A note on the geometric ergodicity of a nonlinear AR-
ARCH model. Statistics & Probability Letters, 80(7-8), 631–638.
DOI: 10.1016/j.spl.2009.12.020. [91, 111]
Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., and Teller, E. (1953).
Equation of state calculations by fast computing machines. Journal of Chemical Physics,
21(6), 1087–1091. DOI: 10.1063/1.1699114. [249]
Meyn, S.P. and Tweedie, R.L. (1993). Markov Chains and Stochastic Stability. Springer-
Verlag, New York. (Freely available at: http://probability.ca/MT/BOOK.pdf). Second
edn. (2009), Cambridge University Press, MA. [111]
Milas, C., Rothman, P.A., Van Dijk, D., and Wildasin, D.E. (Eds.) (2006). Nonlinear Time
Series Analysis of Business Cycles. Elsevier, Amsterdam, The Netherlands. [597]
Mira, S. and Escribano, A. (2006). Nonlinear time series models: Consistency and asymp-
totic normality of NLS under new conditions. In W.A. Barnett et al. (Eds.), Nonlinear
Econometric Modeling in Time Series. Cambridge University Press, Cambridge, UK, pp.
119–164. [247]
Moeanaddin, R. and Tong, H. (1988). A comparison of likelihood ratio test and CUSUM test
for threshold autoregression. The Statistician, 37(2), 213–225. Addendum & Corrigendum
37(4/5), p. 473. DOI: 10.2307/2348695 and DOI: 10.2307/2348773. [193]
Mohler, R.R. (Ed.) (1987). Nonlinear Time Series and Signal Processing. Springer-Verlag,
Berlin. [597]
Montgomery, A.L., Zarnowitz, V., Tsay, R.S., and Tiao, G.C. (1998). Forecasting the U.S.
unemployment rate. Journal of the American Statistical Association, 93(442), 478–493.
DOI: 10.1080/01621459.1998.10473696. [23, 276]
Moon, Y-I., Lall, U., and Kwon, H-H. (2008). Non-parametric short-term forecasts of the
Great Salt Lake using atmospheric indices. International Journal of Climatology, 28(3),
361–370. DOI: 10.1002/joc.1533. [387]
Moran, P.A.P. (1953). The statistical analysis of the Canadian lynx cycle. Australian Journal
of Zoology, 1(2), 163–173. [293]
576 References
Mudholkar, G.S., Marchetti, C.E., and Lin, C.T. (2002). Independence characterizations and
testing normality against restricted skewness-kurtosis alternatives. Journal of Statistical
Planning and Inference, 104(2), 485–501. DOI: 10.1016/s0378-3758(01)00253-1. [23]
Nadaraya, E.A. (1964). On estimating regression. Theory of Probability and its Applications,
15(1), 134–137. [302]
Nelsen, R.B. (2006). An Introduction to Copulas (2nd edn.). Springer-Verlag, New York.
DOI: 10.1007/0-387-28678-0. [305, 306]
Newey, W.K. and West, K.D. (1987). A simple, positive semi-definite, heteroskedasticity
and autocorrelation consistent covariance matrix. Econometrica, 55(3), 703–708.
DOI: 10.2307/1913610. [431, 516, 517]
Nichols, J.M., Olson, C.C., Michalowicz, J.V., and Bucholtz, F. (2009). The bispectrum and
bicoherence for quadratically nonlinear systems subject to non-Gaussian inputs. IEEE
Transactions on Signal Processing, 57(10), 3879–3890.
DOI: 10.1109/tsp.2009.2024267. [150]
Nicholls, D.F. and Quinn, B.G. (1981). The estimation of multivariate random coefficient
autoregressive models. Journal of Multivariate Analysis, 11(4), 544–555.
DOI: 10.1016/0047-259x(81)90095-6. [455, 485]
Nicholls, D.F. and Quinn, B.G. (1982). Random Coefficient Autoregressive Models: An In-
troduction. Springer-Verlag, New York.
DOI: 10.1007/978-1-4684-6273-9. [73, 90, 455, 485, 597]
Nielsen, H.A. and Madsen, H. (2001). A generalization of some classical time series tools.
Computational Statistics & Data Analysis, 37(1), 13–31.
DOI: 10.1016/s0167-9473(00)00061-x. [23]
Niglio, M. (2007). Multi-step forecasts from threshold ARMA models using asymmetric loss
functions. Statistical Methods & Applications, 16(3), 395–410.
DOI: 10.1007/s10260-007-0044-x. [429]
Niglio, M. and Vitale, C.D. (2010a). Local unit roots and global stationarity of TARMA
models. Methodology and Computing in Applied Probability, 14(1), 17–34.
DOI: 10.1007/s11009-010-9166-y. [100]
Niglio, M. and Vitale, C.D. (2010b). Generalization of some linear time series property to
nonlinear domain. In C. Perna and M. Sibillo (Eds.), Mathematical and Statistical Methods
for Actuarial Sciences and Finance. Springer-Verlag, New York, pp. 323–331.
DOI: 10.1007/978-88-470-2342-0 38. [102]
References 577
Niglio, M. and Vitale, C.D. (2013). Vector threshold moving average models: Model spe-
cification and invertibility. In N. Torelli et al. (Eds.) Advances in Theoretical and Applied
Statistics. Springer-Verlag, New York, pp. 87–98. DOI: 10.1007/978-3-642-35588-2 9. [448]
Niglio, M. and Vitale, C.D. (2015). Threshold vector ARMA models. Communications in
Statistics: Theory and Methods, 44(14), 2911–2923.
DOI: 10.1080/03610926.2013.814785. [448]
Nørgaard, M., Ravn, O., Poulsen, N.K., and Hansen, L.K. (2000). Neural Networks for
Modelling and Control of Dynamic Systems. Springer-Verlag, New York.
DOI: 10.1007/978-1-4471-0453-7. [74]
Norman, S. (2008). Systematic small sample bias in two regime SETAR model estimation.
Economics Letters, 99(1), 134–138. DOI: 10.1016/j.econlet.2007.06.013. [246]
Öhrvik, J. and Schoier, G. (2005). SETAR model selection – A bootstrap approach. Com-
putational Statistics, 20(4), 559–573. DOI: 10.1007/bf02741315. [249]
Oja, H. (1983). Descriptive statistics for multivariate distributions. Statistics & Probability
Letters, 1(6), 327–332. DOI: 10.1016/0167-7152(83)90054-8. [496]
Ozaki, T. (1982). The statistical analysis of perturbed limit cycle processes using nonlinear
time series models. Journal of Time Series Analysis, 3(1), 29–41.
DOI: 10.1111/j.1467-9892.1982.tb00328.x. [293]
Ozaki, T. and Oda, H. (1978). Non-linear time series models with identification by Akaike’s
information criterion. In D. Dubuisson (Ed.) Information and Systems. Pergamon, Oxford,
pp. 83–91. [73]
Pan, L. and Politis, D.N. (2016). Bootstrap prediction intervals for linear, nonlinear and
nonparmetric autoregressions (with discussion). Journal of Statistical Planning and In-
ference, 177, 1–27. DOI: 10.1016/j.jspi.2014.10.003. [410]
Paparoditis, E. and Politis, D.N. (2001). A Markovian local resampling scheme for nonpara-
metric estimators in time series analysis. Econometric Theory, 17(3), 540–566.
DOI: 10.1017/s0266466601173020. [356]
Paparoditis, E. and Politis, D.N. (2002). The local bootstrap for Markov processes. Journal
of Statistical Planning and Inference, 108(1-2), 301–328.
DOI: 10.1016/s0378-3758(02)00315-4. [325, 326, 356]
Parzen, E. (1962). On estimation of a probability function and its mode. Annals Mathem-
atical Statistics, 33(3), 1065–1076. DOI: 10.1214/aoms/1177704472. [305]
Patterson, D.M. and Ashley, R.A. (2000). A Nonlinear Time Series Workshop. Kluwer
Academic Publishers, Norwell, MA. DOI: 10.1007/978-1-4419-8688-7. [150, 151, 597]
578 References
Péguin–Feissolle, A., Strikholm, B., and Teräsvirta, T. (2013). Testing the Granger non-
causality hypothesis in stationary nonlinear models of unknown functional form. Com-
munications in Statistics: Simulation and Computation, 42(5), 1063–1087.
DOI: 10.1080/03610918.2012.661500. [523]
Pemberton, J. (1987). Exact least squares multi-step prediction from non-linear autoregress-
ive models. Journal of Time Series Analysis, 8(4), 443–448.
DOI: 10.1111/j.1467-9892.1987.tb00007.x. [393, 428]
Perera, S. (2004). Maximum quasi-likelihood estimation for the NEAR(2) model. Journal of
Time Series Analysis, 25(5), 723–732. DOI: 10.1111/j.1467-9892.2004.01886.x. [74]
Pesaran, M.H. and Potter, S.M. (1997). A floor and ceiling model of US output. Journal of
Economic Dynamics & Control, 21(4-5), 661–695.
DOI: 10.1016/s0165-1889(96)00002-4. [79]
Pesaran, M.H. and Shin, Y. (1998). Generalized impulse response analysis in linear mul-
tivariate models. Economics Letters, 58(1), 17–29.
DOI: 10.1016/s0165-1765(97)00214-0. [490]
Pesaran, M.H. and Timmermann, A.G. (1992). A simple nonparametric test of predictive
performance. Journal of Business & Economic Statistics, 10(4), 461–465.
DOI: 10.2307/1391822. [429, 431]
Petruccelli, J.D. (1986). On the consistency of least squares estimators for a threshold AR(1)
model. Journal of Time Series Analysis, 7(4), 269–278.
DOI: 10.1111/j.1467-9892.1986.tb00494.x. [247]
Petruccelli, J.D. (1990). A comparison of tests for SETAR-type non-linearity in time series.
Journal of Forecasting, 9(1), 25–36. DOI: 10.1002/for.3980090104. [189, 193]
Petruccelli, J.D. and Davies, N. (1986). A portmanteau test for self-exciting threshold
autoregressive-type nonlinearity in time series. Biometrika, 73(3), 687–694.
DOI: 10.1093/biomet/73.3.687. [183, 189, 193]
Petruccelli, J.D. and Woolford, S.W. (1984). A threshold AR(1) model. Journal of Applied
Probability, 21(2), 270–286. DOI: 10.2307/3213639. [100]
Pham, D.T. (1986). The mixing property of bilinear and generalised random coefficient
autoregressive models. Stochastic Processes and their Applications, 23(2), 291–300.
DOI: 10.1016/0304-4149(86)90042-6. [90, 111]
Pham, D.T., Chan, K.S., and Tong H. (1991). Strong consistency of the least squares estim-
ator for a non-ergodic threshold autoregressive model. Statistica Sinica, 1(2), 361–369.
[247]
References 579
Pham, D.T. and Tran, L.T. (1981). On the first-order bilinear time series model. Journal of
Applied Probability, 18(3), 617–627. DOI: 10.2307/3213316. [103]
Pham, D.T. and Tran, L.T. (1985). Some mixing properties of time series models. Stochastic
Processes and their Applications, 19(2), 297–303.
DOI: 10.1016/0304-4149(85)90031-6. [111]
Pinsker, M.S. (1964). Information and Information Stability of Random Variables and Pro-
cesses. Holden-Day, San Francisco. [18]
Pinson, P., McSharry, P., and Madsen, H. (2010). Reliability diagrams for nonparametric
density forecasts of continuous variables: Accounting for serial correlation. Quarterly
Journal of the Royal Meteorological Society, 136(646), 77–90, Part A.
DOI: 10.1002/qj.559. [430]
Pippenger, M.K. and Goering, G.E. (2000). Additional results on the power of unit root and
cointegration tests under threshold processes. Applied Economics Letters, 7(10), 641–644.
DOI: 10.1080/135048500415932. [486]
Pitarakis, J.-Y. (2006). Model selection uncertainty and detection of threshold effects. Stud-
ies in Nonlinear Dynamics & Econometrics, 10(1), 1–30.
DOI: 10.2202/1558-3708.1256. [187]
Pitarakis, J.-Y. (2008). Comments on: Threshold autoregression with a unit root. Econo-
metrica, 76(5), 1207–1217. DOI: 10.3982/ECTA6979. [189]
Polinik, W. and Yao, Q. (2000). Conditional minimum volume predictive regions for
stochastic processes. Journal of the American Statistical Association, 95(450), 509–519.
DOI: 10.2307/2669395. [384, 413, 429]
Politis, D.N. (2013). Model-free model-fitting and predictive distributions. Test, 22(2), 183–
250 (with discussion). DOI: 10.1007/s11749-013-0317-7. [412]
Politis, D.N. (2015). Model-Free Prediction and Regression. Springer-Verlag, New York.
DOI: 10.1007/978-3-319-21347-7. [412]
Politis, D.N. and Romano, J.P. (1992). A circular block-resampling procedure for stationary
data. In R. LePage and L. Billard (Eds.) Exploring the Limits of Bootstrap. Wiley, New
York, pp. 263–270. [329]
Politis, D.N. and Romano, J.P. (1994). The stationary bootstrap. Journal of the American
Statistical Association, 89(428), 1303–1313. DOI: 10.2307/2290993. [321]
Porcher, R. and Thomas, G. (2003). Order determination in nonlinear time series by pen-
alized least-squares. Communications in Statistics: Simulation and Computation, 32(4),
1115–1129. DOI: 10.1081/sac-120023881. [383]
580 References
Potter, S.M. (2000). Nonlinear impulse response functions. Journal of Economic Dynamics
& Control, 24(10), 1425–1446. DOI: 10.1016/s0165-1889(99)00013-5. [77]
Pourahmadi, M. (1988). Stationarity of the solution of Xt = At Xt−1 +εt and analysis of non-
Gaussian dependent random variables. Journal of Time Series Analysis, 9(3), 225–239.
DOI: 10.1111/j.1467-9892.1988.tb00467.x. [110]
Priestley, M.B. (1981). Spectral Analysis and Time Series: Vol. 1. Academic Press, New
York. [1]
Priestley, M.B. (1988). Non-linear and Non-stationary Time Series Analysis. Academic
Press, New York. [73, 149, 597]
Priestley, M.B. and Gabr, M.M. (1993). Bispectral analysis of non-stationary processes. In
C.R. Rao (Ed.) Multivariate Analysis: Future Directions. North-Holland, Amsterdam,
Chapter 16, pp. 295–317. [149, 150]
Psaradakis, Z., Sola, M., Spagnolo, F., and Spagnola, N. (2009). Selecting nonlinear time
series models using information criteria. Journal of Time Series Analysis, 30(4), 369–394.
DOI: 10.1111/j.1467-9892.2009.00614.x. [249]
Puchstein, R. and Preuß, P. (2016). Testing for stationarity in multivariate locally stationary
processes. Journal of Time Series Analysis, 37(1), 3–29. DOI: 10.1111/jtsa.12133. [522]
Qi, M. and Zhang, G.P. (2001). An investigation of model selection criteria for neural network
time series forecasting. European Journal of Operational Research, 132(3), 666–680.
DOI: 10.1016/s0377-2217(00)00171-5. [249]
Quade, D. (1967). Rank analysis of covariance. Journal of the American Statistical Associ-
ation, 62(320), 1187–1200. DOI: 10.1080/01621459.1967.10500925. [17]
Quinn, B.G. (1982). Stationarity and invertibility of simple bilinear models. Stochastic Pro-
cesses and their Applications, 12(2), 225–230. DOI: 10.1016/0304-4149(82)90045-x. [103]
Rabemananjara, R. and Zakoı̈an, J.-M. (1993). Threshold ARCH models and asymmetries
in volatility. Journal of Applied Econometrics, 8(1), 31–49.
DOI: 10.1002/jae.3950080104. [81]
References 581
Racine, J.S. and Maasoumi, E. (2007). A versatile and robust metric entropy test of time-
reversibility, and other hypotheses. Journal of Econometrics, 138(2), 547–567.
DOI: 10.1016/j.jeconom.2006.05.009. [333]
Raftery, A.E. (1982). Generalized non-normal time series models. In O.D. Anderson (Ed.)
Time Series Analysis: Theory and Practice 1. North-Holland, Amsterdam, pp. 621–640.
[74]
Ramsey, J.B. (1969). Tests for specification errors in classical linear least squares regression
analysis. Journal of the Royal Statistical Society, B 31(2), 350–371. [189]
Ramsey, J.B. and Rothman, P. (1996). Time irreversibility and business cycle asymmetry.
Journal of Money, Credit and Banking, 28(1), 1–21. DOI: 10.2307/2077963. [317, 318]
Rao, C.R. (1973). Linear Statistical Inference and Its Applications (2nd edn.). Wiley, New
York. DOI: 10.1002/9780470316436. [459]
Rao Jammalamadaka, S., Subba Rao, T., and Terdik, G. (2006). Higher order cumulants
of random vectors and applications to statistical inference and time series. Sankhya:
The Indian Journal of Statistics, A 68(2), 326–356. Available at: http://eprints.
ma.man.ac.uk/188/, and https://www.researchgate.net/publication/266584530_
Higher_order_statistics_and_multivariate_vector_Hermite_polynomials_for_
nonlinear_analysis_of_multidimensional_time_series. [522]
Rapach, D.E. and Wohar, M.E. (2006). The out-of-sample forecasting performance of non-
linear models of real exchange rate behaviour. International Journal of Forecasting, 22(2),
341–361. DOI: 10.1016/j.ijforecast.2005.09.006. [430]
Rech, G., Teräsvirta, T., and Tschernig, R. (2001). A simple variable selection technique for
nonlinear models. Communications in Statistics: Theory and Methods, 30(6), 1227–1241.
DOI: 10.1081/sta-100104360. [487, 522]
Resnick, S. and Van den Berg, E. (2000a). Sample correlation behavior for the heavy tailed
general bilinear process. Communication in Statistics: Stochastic Models, 16(2), 233–258.
DOI: 10.1080/15326340008807586. [23]
Resnick, S. and Van den Berg, E. (2000b). A test for nonlinearity of time series with infinite
variance. Extremes, 3(2), 145–172. DOI: 10.1023/A:1009996916066. [23]
582 References
Rinke, S. and Sibbertsen, P. (2016). Information criteria for nonlinear time series models.
Studies in Nonlinear Dynamics & Econometrics, 20(3), 325–341.
DOI: 10.1515/snde-2015-0026. [249]
Rio, E. (1993). Covariance inequalities for strongly mixing processes. Annales de l’Institute
Henri Poincaré–Probabilité et Statistiques, Section B, 29(4), 587–597. [96]
Rissanen, J. (1986). Stochastic complexity and modeling. The Annals of Statistics, 14(3),
1080–1100. DOI: 10.1214/aos/1176350051. [232]
Robert, C.P. and Casella, G. (2004). Monte Carlo Statistical Methods (2nd edn.). Springer-
Verlag, New York. DOI: 10.1007/978-1-4757-4145-2. [233, 249]
Robinson, P.M. (1977). The estimation of a nonlinear moving average model. Stochastic
Processes and their Applications, 5(1), 81–89. DOI: 10.1016/0304-4149(77)90052-7. [73]
Robinson, P.M. (1983). Nonparametric estimators for time series. Journal of Time Series
Analysis, 4(3), 185–207. DOI: 10.1111/j.1467-9892.1983.tb00368.x. [356]
Robinzonov, N., Tutz, G., and Hothorn, T. (2012). Boosting techniques for nonlinear time
series models. Advances in Statistical Analysis, 96(1), 99–122.
DOI: 10.1007/s10182-011-0163-4. [381, 383]
Rota, G.C. (1964). On the Foundations of Combinatorial Theory. I. Theory of Möbius Func-
tions. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 2(4), 340–368.
DOI: 10.1007/bf00531932. [285]
Rothman, P. (1992). The comparative power of the TR test against simple threshold models.
Journal of Applied Econometrics, 7(S1), 187–195. DOI: 10.1002/jae.3950070513. [333]
Rothman, P. (1996). FORTRAN programs for running the TR test: A guide and examples.
Studies in Nonlinear Dynamics & Econometrics, 1(4). DOI: 10.2202/1558-3708.1023. [333]
Rothman, P. (Ed.) (1999). Nonlinear Time Series Analysis of Economic and Financial Data
Springer Science+Business Media, New York. DOI: 10.1007/978-1-4615-5129-4. [597]
Rusticelli, E., Ashley, R.A., Dagum, E.B., and Patterson, D.G. (2009). A new bispectral
test for nonlinear serial dependence. Econometric Reviews, 28(1-3), 279–293.
DOI: 10.1080/07474930802388090. [136, 147]
Saikkonen, P. (2005). Stability results for nonlinear error correction models. Journal of
Econometrics, 127(1), 69–81. DOI: 10.1016/j.jeconom.2004.03.001. [455]
References 583
Saikkonen, P. (2008). Stability of regime switching error correction models under linear
cointegration. Econometric Theory, 24(01), 294–318.
DOI: 10.1017/s0266466608080122. [296, 455]
Saikkonen, P. and Luukkonen, R. (1988). Lagrange multiplier tests for testing non-linearities
in time series models. Scandinavian Journal of Statistics, 15(1), 55–68. [158, 188, 193]
Saikkonen, P. and Luukkonen, R. (1991). Power properties of a time series linearity test
against some simple bilinear alternatives. Statistica Sinica, 1(2), 453–464. [193]
Sakaguchi, F. (1991). A relation for ‘linearity’ of the bispectrum. Journal of Time Series
Analysis, 12(3), 267–272. DOI: 10.1111/j.1467-9892.1991.tb00082.x. [152]
Sakamoto, W. (2007). MARS: Selecting basis and knots with the empirical Bayes method.
Computational Statistics, 22(4), 583–597. DOI: 10.1007/s00180-007-0075-7. [383]
Samia, N.I., Chan, K.S., and Stenseth, N.C. (2007). A generalized threshold mixed model
for analyzing nonnormal nonlinear time series, with application to plague in Kazakhstan.
Biometrika, 94(1), 101–118. DOI: 10.1093/biomet/asm006. [79]
Samworth, R.J. and Wand, M.P. (2010). Asymptotics and optimal bandwidth selection for
highest density region estimation. The Annals of Statistics, 38(3), 1767–1792.
DOI: 10.1214/09-aos766. [429]
Schleer–van Gellecom, F. (2015). Finding starting-values for the estimation of vector STAR
models. Econometrics, 3(1), 65–90. DOI: 10.3390/econometrics3010065. [486]
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2),
461–464. DOI: 10.1214/aos/1176344136. [231]
Scott, D.W. (1992). Multivariate Density Estimation: Theory, Practice, and Visualization.
Wiley, New York (2nd edn., 2015). DOI: 10.1002/9781118575574. [521, 525]
Seo, M.H. (2006). Bootstrap testing for the null of no cointegration in a threshold vector
error correction model. Journal of Econometrics, 134(1), 129–150.
DOI: 10.1016/j.jeconom.2005.06.018. [486]
Seo, M.H. (2008). Unit root test in a threshold autoregression: Asymptotic theory and
residual-based bootstrap. Econometric Theory, 24(06), 1699–1716.
DOI: 10.1017/s0266466608080663. [189]
Serfling, R.J. (1980). Approximation Theorems of Mathematical Statistics. Wiley, New York.
DOI: 10.1002/9780470316481. [308]
584 References
Serfling, R.J. (2002). Quantile functions for multivariate analysis: Approaches and applica-
tions. Statistica Neerlandica, 56(2), 214–232. DOI: 10.1111/1467-9574.00195. [521]
Sesay, S.A.O. and Subba Rao, T. (1992). Frequency-domain estimation of bilinear time series
models. Journal of Time Series Analysis, 13(6), 521–545.
DOI: 10.1111/j.1467-9892.1992.tb00124.x. [486]
Shafik, N. and Tutz, G. (2009). Boosting nonlinear additive autoregressive time series. Com-
putational Statistics & Data Analysis, 53(7), 2453–2464.
DOI: 10.1016/j.csda.2008.12.006. [381, 383]
Sharifdoost, M., Mahmoodi, S., and Pasha, E. (2009). A statistical test for time reversibility
of stationary finite state Markov chains. Applied Mathematical Sciences, 3(52), 2563–2574.
Available at: http://www.m-hikari.com/ams/ams-password-2009/ams-password49-
52-2009/lotfiAMS49-52-2009-4.pdf. [333]
Shorack, G.R. and Wellner, J.A. (1984). Empirical Processes with Applications in Statistics.
Wiley, New York. DOI: 10.1137/1.9780898719017. [285]
Silverman, B.W. (1986). Density Estimation for Statistics and Data Analysis. Chapman &
Hall, London. DOI: 10.1007/978-1-4899-3324-9. [268, 270, 305, 358]
Simonoff, J.S. and Tsai, C.-L. (1999). Semiparametric and additive model selection using an
improved Akaike information criterion. Journal of Computational and Graphical Statistics,
8(1), 22–40. DOI: 10.2307/1390918. [249]
Singh, R.S. and Ullah, A. (1985). Nonparametric time-series estimation of joint DGP, con-
ditional DGP and vector autoregression. Econometric Theory, 1(01), 27–52.
DOI: 10.1017/s0266466600010987. [356]
Skaug, H.J. and Tjøstheim, D. (1993a). Nonparametric tests of serial dependence. In: T.
Subba Rao (Ed.) Developments in Time Series Analysis. Chapman & Hall, London, pp.
207–229. [264, 271, 297, 311]
Skaug, H.J. and Tjøstheim, D. (1993b). A nonparametric test of serial independence based
on the empirical distribution function. Biometrika, 80(3), 591–602.
DOI: 10.1093/biomet/80.3.591. [271, 272, 297]
Skaug, H.J. and Tjøstheim, D. (1996). Measures of distance between densities with applica-
tion to testing for serial independence. In P.M. Robinson, and M. Rosenblatt (Eds.) Time
Series Analysis in Memory of E.J. Hannan. Springer-Verlag, New York, pp. 363–377. [271,
297]
Smith, J. and Wallis, K.F. (2009). A simple explanation of the forecast combination puzzle.
Oxford Bulletin of Economics and Statistics, 71(3), 331–355.
DOI: 10.1111/j.1468-0084.2008.00541.x. [425]
Smith, R.L. (1986). Maximum likelihood estimation for the NEAR(2) model. Journal of the
Royal Statistical Society, A 48(2), 251–257. [74]
So, M.P., Li, W.K., and Lam, K. (2002). A threshold stochastic volatility model. Journal of
Forecasting, 21(7), 473–500. DOI: 10.1002/for.840. [81]
Solari, S. and Van Gelder, P.H.A.J.M. (2011). On the use of vector autoregress-
ive (VAR) and regime switching VAR models for the simulation of sea and wind
state parameters. In C.G. Soares et al. (Eds.), Marine Technology and Engin-
eering, Volume 1. Taylor & Francis Group, London, pp. 217–230. Available at:
http : / / www . tbm . tudelft . nl / fileadmin / Faculteit / CiTG / Over _ de _ faculteit /
Afdelingen/Afdeling_Waterbouwkunde/sectie_waterbouwkunde/people/personal/
gelder/publications/papers/doc/solari_015.pdf. [486]
Sorour, A. and Tong, H. (1993). A note on tests for threshold-type non-linearity in open
loop systems. Applied Statistics, 42(1), 95–104. DOI: 10.2307/2347412. [189]
Stam, C.J. (2005). Nonlinear dynamical analysis of EEG and MEG: Review of an emerging
field. Clinical Neurophysiology, 116, 2266-2301. [23]
Steinberg, I.Z. (1986). On the time reversal of noise signals. Biophysical Journal, 50(1),
171–179. DOI: 10.1016/s0006-3495(86)83449-x. [317]
Stenseth, N.C., Chan, K.S., Tavecchia, G., Coulson, T., Mysterud, A., Clutton-Brock, T.,
and Grenfell, B. (2004). Modelling non-additive and nonlinear signals from climatic noise
in ecological time series: Soay sheep as an example. Proceedings of The Royal Society
London, B 271(1552), 1985–1993. DOI: 10.1098/rspb.2004.2794. [73]
Stenseth, N.C., Falck, W., Bjørnstad, O.N., and Krebs, C.J. (1997). Population regulation
in snowshoe hare and Canadian lynx: Asymmetric food web configurations between hare
and lynx. Proceedings of the National Academy of Sciences USA, 94(10), 5147–5152.
DOI: 10.1073/pnas.94.10.5147. [293]
Stensholt, B.K. and Tjøstheim, D. (1987). Multiple bilinear time series models. Journal of
Time Series Analysis, 8(2), 221–233.
DOI: 10.1111/j.1467-9892.1987.tb00434.x. [441, 442, 443]
Stephens, M.A. (1974). EDF statistics for goodness of fit and some comparisons. Journal of
the American Statistical Association, 69(347), 730–737.
DOI: 10.2307/2286009 and DOI: 10.1080/01621459.1974.10480196. [134]
Stephens, M.A. (1986). Tests based on EDF statistics. In R.B. D’Agostino and M.A. Steph-
ens (Eds.) Goodness-of-Fit Techniques. Marcel Dekker, New York, pp. 97–193. [135]
Steuber, T.L., Kiessler, P.C., and Lund, R. (2012). Testing for reversibility in Markov chain
data. Probability in the Engineering and Informational Sciences, 26(04), 593–611.
DOI: 10.1017/s0269964812000228. [333]
586 References
Stoica, P., Eykhoff, P., Janssen, P., and Söderström, T. (1986). Model-structure selection
by cross-validation. International Journal of Control, 43(6), 1841–1878.
DOI: 10.1080/00207178608933575. [234]
Stone, C.J. (1977). Consistent nonparametric regression. The Annals of Statistics, 5(4),
595–645. DOI: 10.1214/aos/1176343886. [354]
Strikholm, B. and Teräsvirta, T. (2006). A sequential procedure for determining the number
of regimes in a threshold autoregressive model. Econometrics Journal, 9(3), 472–491.
DOI: 10.1111/j.1368-423x.2006.00194.x. [249]
Su, L. and White, H. (2008). A nonparametric Hellinger metric test for conditional inde-
pendence. Econometric Theory, 24(04), 829–864. DOI: 10.1017/s0266466608080341. [294]
Suárez–Fariñas, M., Pedreira, C.E., and Medeiros, M.C. (2004). Local global neural net-
works: A new approach for nonlinear time series modeling. Journal of the American
Statistical Association, 99(468), 1092–1107.
DOI: 10.1198/016214504000001691. [64, 75, 247]
Subba Rao, T. (1981). On the theory of bilinear time series models. Journal of the Royal
Statistical Society, B 43(2), 244–255. [103]
Subba Rao, T. and Gabr, M.M. (1980). A test for linearity of stationary time series. Journal
of Time Series Analysis, 1(2), 145–158.
DOI: 10.1111/j.1467-9892.1980.tb00308.x. [119, 126, 128]
Subba Rao, T. and Gabr, M.M. (1984). An Introduction to Bispectral Analysis and Bilinear
Time Series Models. Springer-Verlag, New York.
DOI: 10.1007/978-1-4684-6318-7. [73, 117, 126, 129, 150, 151, 486, 597]
Subba Rao, T. and Terdik, G. (2003). On the theory of discrete and continuous bilinear time
series models. In D.N. Shanbhag and C.R. Rao (Eds.) Stochastic Processes: Modelling and
Simulation, Handbook of Statistics, Vol. 21. North-Holland, Amsterdam, pp. 827–870.
DOI: 10.1016/s0169-7161(03)21023-3. [485]
Subba Rao, T. and Wong, W.K. (1998). Tests for Gaussianity and linearity of multivariate
stationary time series. Journal of Statistical Planning and Inference, 68(2), 373–386.
DOI: 10.1016/s0378-3758(97)00150-x. [522]
Subba Rao, T. and Wong, W.K. (1999). Some contributions to multivariate nonlinear time
series bilinear models. In S. Gosh (Ed.) Asymptotics, Nonparametrics and Time Series.
Marcel Dekker, New York, pp. 259–294. [486, 511]
Swanson, N.R. and White, H. (1997a). Forecasting economic time series using flexible versus
fixed specification and linear versus nonlinear econometric models. International Journal
of Forecasting, 13(4), 439–461. DOI: 10.1016/s0169-2070(97)00030-7. [429]
References 587
Swanson, N.R. and White, H. (1997b). A model selection approach to real-time macroe-
conomic forecasting using linear models and artificial neural networks. The Review of
Economics and Statistics, 79(4), 540–550. DOI: 10.1162/003465397557123. [429]
Székely, G.J., Rizzo, M.L., and Bakirov, N.K. (2007). Measuring and testing dependence by
correlation of distances. The Annals of Statistics, 35(6), 2760–2794.
DOI: 10.1214/009053607000000505. [296]
Tai, H. and Chan, K.S. (2000). Testing for nonlinearity with partially observed time series.
Biometrika, 87(4), 805–821. DOI: 10.1093/biomet/87.4.805. [188]
Tai, H. and Chan, K.S. (2002). A note on testing for nonlinearity with partially observed
time series. Biometrika, 89(1), 245–250. DOI: 10.1093/biomet/89.1.245. [188]
Tay, A.S. and Wallis, K.F. (2000). Density forecasting: A survey. Journal of Forecasting,
1(4), 235–254. DOI: 10.1002/1099-131X(200007). Reprinted in M.P. Clements and D.F.
Hendry (Eds.), A Companion to Economic Forecasting. Blackwells, Oxford (2002), pp.
45–68. [430]
Teles, P. and Wei, W.W.S. (2000). The effects of temporal aggregation on tests of linearity
of a time series. Computational Statistics & Data Analysis, 34(1), 91–103.
DOI: 10.1016/s0167-9473(99)00072-9. [151]
Teräsvirta, T., Lin, C.-F., and Granger, C.W.J. (1993). Power of the neural network linearity
test. Journal of Time Series Analysis, 14(2), 209–220.
DOI: 10.1111/j.1467-9892.1993.tb00139.x. [188, 190]
Teräsvirta, T., Tjøstheim, D., and Granger, C.W.J. (2010). Modelling Nonlinear Economic
Time Series. Oxford University Press, New York.
DOI: 10.1093/acprof:oso/9780199587148.001.0001. [201, 597]
Teräsvirta, T. and Yang, Y. (2014a). Linearity and misspecification tests for vector smooth
transition regression models. CORE Discussion paper 2014/62. Available at: http:
//www.uclouvain.be/cps/ucl/doc/core/documents/coredp2014_62web.pdf. Also
available as CREATES Research Paper 2014-04, Aarhus University. [469, 470]
Terdik, G. (1999). Bilinear Stochastic Models and Related Problems of Nonlinear Time Series
Analysis. Lecture Notes in Statistics 14. Springer-Verlag, New York.
DOI: 10.1007/978-1-4612-1552-3. (Freely available at: http : / / dragon . unideb . hu /
~terdik/PostScr/TerdikGyLNS142.pdf). [23, 115, 140, 146, 273, 597]
Terdik, G., Gál, Z., Iglói, E., and Molnár, S. (2002). Bispectral analysis of traffic in high-
speed networks. Computers & Mathematics with Applications, 43(12), 1575–1583.
DOI: 10.1016/s0898-1221(02)00120-7. [146]
Terdik, G. and Máth, J. (1993). Bispectrum based checking of linear predictability for time
series. In T. Subba Rao (Ed.) Developments in Time Series Analysis. Chapman & Hall,
London, pp. 274–282. DOI: 10.1007/978-1-4899-4515-0 19. [141, 146]
Terdik, G. and Máth, J. (1998). A new test of linearity of time series based on the bispectrum.
Journal of Time Series Analysis, 19(6), 737–753.
DOI: 10.1111/1467-9892.00120. [140, 142, 143, 146, 147]
Thavaneswaran, A. and Abraham, B. (1988). Estimation for non-linear time series models
using estimating equations. Journal of Time Series Analysis, 9(1), 99–108.
DOI: 10.1111/j.1467-9892.1988.tb00457.x. [248]
Theiler, J., Eubank, S., Longtin, A., Galdrikian, B., and Farmer, J.D. (1992). Testing for
nonlinearity in time series: The method of surrogate data. Physica, D 58(1-4), 77–94.
DOI: 10.1016/0167-2789(92)90102-s. [150, 188]
Theiler, J. and Prichard, D. (1996). Constrained Monte-Carlo method for hypothesis testing.
Physica, D 94(4), 221–235. DOI: 10.1016/0167-2789(96)00050-4. [188]
Tiao, G.C. and Tsay, R.S. (1994). Some advances in nonlinear and adaptive modeling in
time series analysis. Journal of Forecasting, 13(2), 109–131.
DOI: 10.1002/for.3980130206. [46]
Tibshirani, R. (1988). Estimating transformations for regression via additivity and variance
stabilization. Journal of the American Statistical Association, 83(402), 194–205.
DOI: 10.2307/2288855. [383]
Tjøstheim, D. (1986a). Some doubly stochastic time series models. Journal of Time Series
Analysis, 17(1), 51–72. DOI: 10.1111/j.1467-9892.1986.tb00485.x. [39]
Tjøstheim, D. (1986b). Estimation in nonlinear time series models. Stochastic Processes and
their Applications, 21(2), 251–273. DOI: 10.1016/0304-4149(86)90099-2. [39, 199, 472, 473]
Tjøstheim, D. (1990). Non-linear time series and Markov chains. Advances in Applied Prob-
ability, 22(3), 587–611. DOI: 10.2307/1427459. [90]
Tong, H. (1977). Discussion of the paper by A.J. Lawrance and N.T. Kottegoda. Journal of
the Royal Statistical Society, A 140(1), 34–35. DOI: 10.2307/2344516. [73, 78]
Tong, H. (1980). A view on non-linear time series building. In O.D. Anderson (Ed.) Time
Series. North-Holland, Amsterdam, pp. 41–56. [73]
Tong, H. (1990). Non-Linear Time Series: A Dynamical System Approach. Oxford Univer-
sity Press, Oxford. [50, 73, 75, 78, 85, 250, 293, 312, 400, 597]
Tong, H. (2007). Birth of the threshold time series model. Statistica Sinica, 17(1), 8–14. [73]
Tong, H. (2011). Threshold models in time series analysis – 30 years on. Statistics and Its
Interface, 49(2), 107–136 (with discussion). DOI: 10.4310/sii.2011.v4.n2.a1. [73]
Tong, H. (2015). Threshold models in time series analysis – some reflections. Journal of
Econometrics, 189(2), 485–491. DOI: 10.1016/j.jeconom.2015.03.039. [73]
Tong, H. and Lim, K.S. (1980). Threshold autoregression, limit cycles and cyclical data.
Journal of the Royal Statistical Society, B 42(3), 245–292 (with discussion). Also published
in Exploration of a Nonlinear World: An Appreciation of Howell Tong’s Contributions to
Statistics, K.S. Chan (Ed.), World Scientific, Singapore.
DOI: 10.1142/9789812836281 0002. [41, 73]
Tong, H. and Moeanaddin, R. (1988). On multi-step non-linear least squares prediction. The
Statistician, 37(2), 101–110. DOI: 10.2307/2348685. [428]
Tong, H. and Yeung, I. (1990). On tests for threshold-type nonlinearity in irregularly spaced
time series. Journal of Statistical Computation and Simulation, 34(4), 172–194.
DOI: 10.1080/00949659008811226. [189]
Tong, H. and Yeung, I. (1991b). On tests for self-exciting threshold autoregressive-type non-
linearity in partially observed time series. Applied Statistics, 40(1), 43–62.
DOI: 10.2307/2347904. [189]
Tong, H., Thanoon, B., and Gudmundson, G.L. (1985). Threshold time series modeling
of two Icelandic riverflow systems. In K.W. Hipel (Ed.) Time Series Analysis in Water
Resources. American Water Research Association, 21, pp. 651–661. [85, 481]
Trapletti, A., Leisch, F., and Hornik, K. (2000). Stationary and integrated autoregressive
neural network processes. Neural Computation, 12(10), 2427–2450.
DOI: 10.1162/089976600300015006. [58]
Tsai, H. and Chan, K.S. (2000). Testing for nonlinearity with partially observed time series.
Biometrika, 87(4), 805–821. DOI: 10.1093/biomet/87.4.805. [189]
Tsai, H. and Chan, K.S. (2002). A note on testing for nonlinearity with partially observed
time series. Biometrika, 89(1), 245–250. DOI: 10.1093/biomet/89.1.245. [189]
Tsallis, C. (1998). Generalized entropy-based criterion for consistent testing. Physical Re-
view, E 58(2), 1442–1445. DOI: 10.1103/physreve.58.1442. [264]
Tsay, R.S. (1986). Nonlinearity tests for time series. Biometrika, 73(2), 461–466.
DOI: 10.1093/biomet/73.2.461. [180, 181, 193]
Tsay, R.S. (1989). Testing and modeling threshold autoregressive processes. Journal of the
American Statistical Association, 84(405), 231–240.
DOI: 10.2307/2289868. [182, 184, 185, 193, 293]
Tsay, R.S. (1991). Detecting and modeling nonlinearity in univariate time series analysis.
Statistica Sinica, 1(2), 431–451. [180, 185, 193]
Tsay, R.S. (1998). Testing and modeling multivariate threshold models. Journal of the Amer-
ican Statistical Association, 93(443), 1188–1202.
DOI: 10.2307/2669861. [447, 464, 465, 481, 482, 488, 492]
Tsay, R.S. (2010). Analysis of Financial Time Series (3rd edn.). Wiley, New York.
DOI: 10.1002/0471264105. [488, 501]
Tschernig, R. and Yang, L. (2000). Nonparametric lag selection for time series. Journal of
Time Series Analysis, 21(4), 457–487. DOI: 10.1111/1467-9892.00193. [358, 359, 522]
Tse, Y.K. and Zuo, X.L. (1998). Testing for conditional heteroskedasticity: Some Monte
Carlo results. Journal of Statistical Computation and Simulation, 58(3), 237–253.
DOI: 10.1080/00949659708811833. [236]
Tsolaki, E.P. (2008). Testing nonstationary time series for Gaussianity and linearity using
the evolutionary bispectrum: An application to internet traffic data. Signal Processing,
88(6), 1355–1367. DOI: 10.1016/j.sigpro.2007.12.011. [150]
Tukey, J.W. (1949). One degree of freedom for non-additivity. Biometrics, 5(3), 232–242.
DOI: 10.2307/3001938. [179]
Tutz, G. and Binder, H. (2006). Generalized additive modelling with implicit variable selec-
tion by likelihood based boosting. Biometrics, 62(4), 961–971.
DOI: 10.1111/j.1541-0420.2006.00578.x. [386]
Ubilava, D. (2012). El Niño, La Niña, and world coffee price dynamics. Agricultural Eco-
nomics, 43(1), 17–26. DOI: 10.1111/j.1574-0862.2011.00562.x. [24]
References 591
Ubilava, D. and Helmers, C.G. (2013). Forecasting ENSO with a smooth transition autore-
gressive model. Environmental Modelling & Software, 40, 181–190.
DOI: 10.1016/j.envsoft.2012.09.008. [24, 215, 422]
Ullah, A. (1996). Entropy, divergence and distance measures with econometric applications.
Journal of Statistical Planning and Inference, 49(1), 137–162.
DOI: 10.1016/0378-3758(95)00034-8. [295]
Van Casteren, P.H.F.M. and De Gooijer, J.G. (1997). Model selection by maximum entropy.
In T.B. Fomby and R.C. Hill (Eds.), Advances in Econometrics (Applying Maximum
Entropy to Econometric Problems), Vol. 12. JAI Press, Connecticut, pp. 135–161.
DOI: 10.1108/s0731-9053(1997)0000012007. [249]
Van Dijk, D. and Franses, P.H. (2003). Selecting a nonlinear time series model using weighted
tests of equal forecast accuracy. Oxford Bulletin of Economics and Statistics, 65(s1), 727–
744. DOI: 10.1046/j.0305-9049.2003.00091.x. [429]
Van Dijk, D., Teräsvirta, T., and Franses, P.H. (2002). Smooth transition autoregressive
models – A survey of recent developments. Econometric Reviews, 21(1), 1–47.
DOI: 10.1081/etc-120002918. [74]
Van Ness, J.W. (1966). Asymptotic normality of bispectral estimates. Annals of Mathemat-
ical Statistics, 37(5), 1257–1275. DOI: 10.1214/aoms/1177699269. [149]
Vavra, M. (2013). Testing for Non-linearity and Asymmetry in Time Series, Ph.D. thesis,
Birbeck college, University of London, UK. Available at: http://bbktheses.da.ulcc.
ac.uk/97/1/final%20Marian%20Vavra.pdf. [190]
Vieu, P. (1995). Order choice in nonlinear autoregressive models. Statistics, 26(4), 307–328.
DOI: 10.1080/02331889508802499. [383]
Wallis, K.F. (2003). Chi-square tests of interval and density forecasts, and the Bank of
England’s fan charts. International Journal of Forecasting, 19(2), 165–175.
DOI: 10.1016/s0169-2070(02)00009-2. [430]
Wallis, K.F. (2011). Combining forecasts – forty years later. Applied Financial Econometrics,
21(1-2), 33–41. DOI: 10.1080/09603107.2011.523179. [430]
Wand, M.P. and Jones, M.C. (1995). Kernel Smoothing. Chapman & Hall, London.
DOI: 10.1007/978-1-4899-4493-1. [298, 385]
Wang, H.B. (2008). Nonlinear ARMA models with functional MA coefficients. Journal of
Time Series Analysis, 29(6), 1032–1056. DOI: 10.1111/j.1467-9892.2008.00594.x. [374]
Watson, G.S. (1964). Smooth regression analysis. Sankhyā, A 26, 359–372. [302]
Wecker, W.E. (1981). Asymmetric time series. Journal of the American Statistical Associ-
ation, 76(373), 16–21. Corrigendum: p. 954. DOI: 10.2307/2287034. [74, 116]
Weiss, A.A. (1986). ARCH and bilinear time series models: comparison and combination.
Journal of Business & Economic Statistics, 4(1), 59–70. DOI: 10.2307/1391387. [188]
Welsh, A.K. and Jernigan. R.W. (1983). A statistic to identify asymmetric time series.
American Statistical Association, Proceedings of the Business and Economic Statistics
Section, pp. 390–395. [194]
West, K.D. (1996). Asymptotic inference about predictive ability. Econometrica, 64(5),
1067–1084. DOI: 10.2307/2171956. [427]
West, K.D. (2001). Tests for forecast encompassing when forecasts depend on estimated
regression parameter. Journal of Business & Economic Statistics, 19(1), 29–33.
DOI: 10.1198/07350010152472580. [427]
West, K.D. (2006). Chapter 3: Forecast evaluation. In G. Elliott et al. (Eds.) Handbook of
Economic Forecasting, Volume 1. North-Holland, Amsterdam, pp. 99–134.
DOI: 10.1016/s1574-0706(05)01003-7. [429]
White, H. (1984). Asymptotic Theory for Econometricians. Academic Press, Orlando, Flor-
ida. [320]
White, H. (1989). An additional hidden unit test for neglected non-linearity in multilayer
feedforward networks. In Proceedings of the International Joint Conference on Neural Net-
works, Washington, D.C. (IEEE Press, New York), Vol. I. San Diego, CA: SOS Printing,
pp. 451–455. DOI: 10.1109/ijcnn.1989.118281. [188]
White, H. (2000). A reality check for data snooping. Econometrica, 68(5), 1097–1126.
DOI: 10.1111/1468-0262.00152. [427, 429]
Wiener, N. (1958). Non-linear Problems in Random Theory. Wiley, London. [72, 597]
References 593
Wilson, G.T. (1969). Factorization of the covariance generating function of a pure moving
average process. SIAM Journal of Numerical Analysis, 6(1), 1–7.
DOI: 10.1137/0706001. [219]
Wolff, R.C.L. and Robinson, P.M. (1994). Independence in time series: Another look at the
BDS test [and Discussion]. Philosophical Transactions Royal Society London, A 348(1688),
383–395. DOI: 10.1098/rsta.1994.0098. [296]
Wong, C.-M. and Kohn, R. (1996). A Bayesian approach to estimating and forecasting
additive nonparametric autoregression in time series. Journal of Time Series Analysis,
17(2), 203–220. DOI: 10.1111/j.1467-9892.1996.tb00273.x. [383]
Wong, C.S. and Li, W.K. (1997). Testing for threshold autoregression with conditional
heteroskedasticity. Biometrika, 84(2), 407–418. DOI: 10.1093/biomet/84.2.407. [188]
Wong, C.S. and Li, W.K. (1998). A note on the corrected Akaike information criterion for
threshold autoregressive models. Journal of Time Series Analysis, 19(1), 113–124.
DOI: 10.1111/1467-9892.00080. [249]
Wong, C.S. and Li, W.K. (2000a). Testing for double threshold autoregressive conditional
heteroskedastic model. Statistica Sinica, 10(1), 173–189. [188]
Wong, C.S. and Li, W.K. (2000b). On a mixture autoregressive model. Journal of the Royal
Statistical Society, B 62(1), 95–15. DOI: 10.1111/1467-9868.00222. [240, 296, 313]
Wong, C.S. and Li, W.K. (2001). On a mixture autoregressive conditional heteroscedastic
model. Journal of the American Statistical Association, 96(455), 982–995.
DOI: 10.1198/016214501753208645. [240, 296]
Wong, W.K. (1997). Frequency domain tests of multivariate Gaussianity and linearity.
Journal of Time Series Analysis, 18(2), 181–194.
DOI: 10.1111/1467-9892.00045. [511, 512]
Wu, E.H.C., Yu, P.L.H., and Li, W.K. (2009). A smoothed bootstrap test for independence
based on mutual information. Computational Statistics & Data Analysis, 53(7), 2524–
2536. DOI: 10.1016/j.csda.2008.11.032. [23, 296]
Wu, T.Z., Yu, K., and Yu, Y. (2010). Single-index quantile regression. Journal of Multivari-
ate Analysis, 101(7), 1607–1621. DOI: 10.1016/j.jmva.2010.02.003. [384]
Wu, T.Z., Lin, H., and Yu, Y. (2011). Single-index coefficient models for nonlinear time
series. Journal of Nonparametric Statistics, 23(1), 37–58.
DOI: 10.1080/10485252.2010.497554. [384]
Xia, X. and An, H.Z. (1999). Projection pursuit autoregression in time series. Journal of
Time Series Analysis, 20(6), 693–714. DOI: 10.1111/1467-9892.00167. [381, 383]
Xia, Y. and Li, W.K. (1999). On single-index coefficient regression models. Journal of the
American Statistical Association, 94(448), 1275–1285. DOI: 10.2307/2669941. [378]
Xia, Y., Tong, H., and Li, W.K. (1999). On extended partially linear single-index models.
Biometrika, 86(4), 831–842. DOI: 10.1093/biomet/86.4.831. [378, 379]
594 References
Yakowitz, S.J. (1985). Nonparametric density estimation, prediction, and regression for
Markov sequences. Journal of the American Statistical Association, 80(389), 215–221.
DOI: http://dx.doi.org/10.2307/2288075 and DOI: 10.1080/01621459.1985.10477164. [382]
Yakowitz, S.J. (1987). Nearest neighbor methods for time series analysis. Journal of Time
Series Analysis, 8(2), 235–247. DOI: 10.1111/j.1467-9892.1987.tb00435.x. [353, 382]
Yang, Y. (2012). Modelling Nonlinear Vector Economic Time Series, Ph.D. thesis, Aarhus
University, Denmark. CREATES Research Paper 2012-7. Available at: http://pure.au.
dk/portal/files/45638557/Yukai_Yang_PhD_Thesis.pdf. [487, 489]
Yang, K. and Shahabi, C. (2007). An efficient k nearest neighbor search for multivariate
time series. Information and Computation, 205(1), 65–98.
DOI: 10.1016/j.ic.2006.08.004. [522]
Yang, L., Härdle, W., and Nielson, J. (1999). Nonparametric autoregression with multiplic-
ative volatility and additive mean. Journal of Time Series Analysis, 20(5), 579–604.
DOI: 10.1111/1467-9892.00159. [382]
Yang, Z., Tian, Z., and Zixia, Y. (2007). GSA-based maximum likelihood estimation for
threshold vector error correction model. Computational Statistics & Data Analysis, 52(1),
109–120. DOI: 10.1016/j.csda.2007.06.003. [486]
Yi, J. and Deng, J. (1994). The ergodicity of vector self excited threshold autoregressive
(VSETAR) models. Applied Mathematics. A Journal of Chinese Universities, Series A
(Chinese Edition), 9(1), 53–59. [486]
Young, P.C. (1993). Time variable and state dependent modelling of non-stationary and
nonlinear time series. In T. Subba Rao (Ed.), Developments in Time Series Analysis.
Chapman & Hall, London, pp. 374–413. [384]
Young, P.C. and Beven, K.J. (1994). Data-based mechanistic modelling and the rainfall-flow
non-linearity. Environmetrics, 5(3), 335–363. DOI: 10.1002/env.3170050311. [384]
Yu, P.L.H., Li, W.K., and Jin, S. (2010). On some models for Value-at-Risk. Econometric
Reviews, 29(5-6), 622–641. DOI: 10.1080/07474938.2010.481972. [81]
Yuan, J. (2000a). Testing linearity for stationary time series using the sample interquartile
range. Journal of Time Series Analysis, 21(6), 713–722.
DOI: 10.1111/1467-9892.00206. [150]
References 595
Yuan, J. (2000b). Testing Gaussianity and linearity for random fields in the frequency do-
main. Journal of Time Series Analysis, 21(6), 723–737.
DOI: 10.1111/1467-9892.00207. [150]
Zeevi, A.J., Meir, R., and Adler, R.J. (1999). Non-linear models for time series using mix-
tures of autoregressive models. Available at: http://citeseerx.ist.psu.edu/viewdoc/
summary?doi=10.1.1.44.4549. [296]
Zhang, J. and Stine, R.A. (2001). Autocovariance structure of Markov regime switching
models and model selection. Journal of Time Series Analysis, 22(1), 107–124.
DOI: 10.1111/1467-9892.00214. [75]
Zhang, X., King, M.L., and Hyndman, R.J. (2006). A Bayesian approach to bandwidth selec-
tion for multivariate kernel density estimation. Computational Statistics & Data Analysis,
50(11), 3009–3031. DOI: 10.1016/j.csda.2005.06.019. [305]
Zhang, X, Wong, H., Li, Y., and Ip, W.-C. (2011). A class of threshold autoregressive
conditional heteroscedastic models. Statistics and Its Interface, 4(2), 149–157.
DOI: 10.4310/sii.2011.v4.n2.a10. [248]
Zhou, Z. and Wu, W.B. (2009). Local linear quantile estimation for nonstationary time
series. The Annals of Statistics, 37(5B), 2696–2729. DOI: 10.1214/08-aos636. [382]
Zhu, K., Yu, P.L.H., and Li, W.K. (2014). Testing for the buffered autoregressive processes.
Statistica Sinica, 24(2), 971–984. DOI: 10.5705/ss.2012.311. [81]
Zivot, E. and Wang, J. (2006). Modeling Financial Time Series with S-Plus (2nd edn.).
Springer-Verlag, New York. DOI: 10.1007/978-0-387-32348-0. Freely available at: http:
//faculty.washington.edu/ezivot/econ589/manual.pdf. [75]
Zoubir, A.M. (1999). Model selection: A bootstrap approach. In IEEE International Con-
ference on Acoustics, Speech, and Signal Processing, Vol. 3, IEEE, Phoenix, AZ, USA,
pp. 1377–1380. DOI: 10.1109/icassp.1999.756237. [140]
Zoubir, A.M. and Iskander, D.R. (1999). Bootstrapping bispectra: An application to test-
ing for departure from Gaussianity of stationary signals. IEEE Transaction on Signal
Processing, 47(3), 880–884. DOI: 10.1109/78.747796. [150]
Books about Nonlinear Time Series Analysis
General
Chaos
Chan (2009)
Cutler and Kaplan (1996)
Douc et al. (2014)
Chan and Tong (2001)
Franses and Van Dijk (2000)
Diks (1999)
Granger and Teräsvirta (1992a)
Kantz and Schreiber (2004)
Guégan (1994)
Vialar (2005)
Priestley (1988)
Teräsvirta et al. (2010) Proceedings
Tong (1990) Barnett et al. (2006)
Wiener (1958) Casdagli and Eubank (1992)
Dagum et al. (2004)
Applications
Fitzgerald et al. (2000)
Casdagli and Eubank (1992)
Franke et al. (1984)
Donner and Barbosa (2008)
Hsiao et al. (2011)
Dunis and Zhou (1998)
Galka (2000) Semi- and nonparametric
Haldrup et al. (2014) Fan and Yao (2003)
Ma and Wohar (2014) Gao (2007)
Milas et al. (2006) Li and Racine (2007)
Patterson and Ashley (2000)
Spectral and signal analysis
Rothman (1999)
Haykin (1979)
Schleer–van Gellecom (2014)
Mohler (1987)
Small (2005)
Threshold and RCA models
Bilinear models
Tong (1983)
Granger and Andersen (1978a)
Nicholls and Quinn (1982)
Subba Rao and Gabr (1984)
Terdik (1999)
The following notation is frequently used throughout the book. The number following the
description of a notation marks the page where the notation is first introduced.
General
≡ equals, by definition 10
⊥ perpendicular, mutually singular (of measures) 457
x
norm of x in L2 (Euclidian norm) 19
x
p Lp -norm 112
! factorial 299
!! semifactorial: (2k − 1)!! = 1 · 3 · 5 · · · (2k − 1) 221
[x] absolute value (integer part) of scalar x (largest integer ≤ x) 127
x the largest integer not greater than x 126
x∧y = min(x, y) 449
x∨y = max(x, y) 198
log(x) natural logarithm of x (with base e = 2.71828 · · · ) 11
log+ (x) = max{log(x), 0)} 89
B backward shift (or lag) operator 62
C = 0.5772156649 · · · , Euler’s constant 89
δij Kronecker delta, where δij = 1 if i = j and δij = 0 if i = j 327
∃ “there exists” 37
h ≡ hT smoothing parameter or bandwidth 209
hb binwidth 270
K(·), Kh (·) kernel function (with bandwidth h) 260
∀ “for all” (“for every”) 2
arg min argument that minimizes a function 58
arg max argument that maximizes a function 340
exp exponential 2
inf infimum (greatest lower bound) 339
min minimum 44
max maximum 37
Leb Lebesgue measure on Rm 98
lim limit (number); also limit (sets) 13
lim inf inferior limit (number); also inferior limit (sets) 91
lim sup superior limit (number); also superior limit (sets) 91
Ran H range of the function H 306
sign(a) sign of the real number a 311
sup supremum (least upper bound) 19
s.t. “subject to” 93
Sets
{·} set designation; also sequence, array 2
∈, ∈ set membership, does not belong to 2
∪ union 41
⊂ subset (strict containment) 198
∩ intersection 41
Ft σ-algebra (information set) 2
∅ empty (null) set 41
I(·) indicator function, i.e. I(z) = 1 if z > 1 and I(z) = 0 if z ≤ 0 16
(·) imaginary part 121
N = {0, 1, 2, . . .}, i.e. the set of all natural numbers, including zero 10
R the set of all real numbers 41
R+ the set all non-negative real numbers 240
Rn , Rm×n the set of real n × 1 vectors (m × n matrices) 37
(·) real part 142
Z = {0, ±1, ±2, . . .}, i.e. the set of all relative integers 2
Z+ = {1, 2, 3, . . .}, i.e. the set of all positive integers 19
Special matrices and vectors
e = (1, 0, . . . , 0) , a vector with 1 in the first entry and zeros
elsewhere 205
1 = (1, . . . , 1) , a unity row vector 491
In identity matrix of order n × n 42
Om×n m × n null matrix 42
0m , 0m×1 m × 1 null vector 42
Operations on matrix A and vector a
A , a transpose of a matrix or vector 13
A−1 inverse of a matrix 129
A# Hankel matrix 219
diag(A) diagonal matrix, containing the diagonal elements of A 262
vec(A) = stacking the elements of A one underneath the other 441
vech(A) = stacking the elements of A on and below the main
diagonal into one vector 180
ρ(A) maximum absolute eigenvalue of A (spectral radius) 90
tr(A) trace 229
Notation and Abbreviations 601
Table 2: List of abbreviations. The number following the description marks the page where
the notation is first introduced. For acronyms given to threshold-type time series models, we
refer to Appendix. 2.B.
CHAPTER 4
4.1 The Subba Rao–Gabr Gaussianity 4.4 Bootstrap-based tests (138)
test (126) 4.5 The MSFE-based linearity test
4.2 The Subba Rao–Gabr linearity statistic (144)
test (129)
4.3 Goodness-of-fit test statistics (135)
CHAPTER 5
(3∗ ) (9)
5.1 LMT test statistic (161) 5.6 Bootstrapping p-values of LRT test
(3∗∗ )
5.2 LMT test statistic (162) statistic (176)
(5)
5.3 FT test statistic (164) 5.7 Tukey’s nonadditivity-type test
(7)
5.4 LMT test statistic (168) statistic (180)
(1,i) (O)
5.5 Bootstrapping p-values of FT test 5.8 FT test statistic (181)
statistic (172) 5.9 CUSUM test statistic (183)
5.10 TAR F test statistic (184)
5.11 New F test statistic (185)
CHAPTER 6
6.1 Nonlinear iterative optimization (200) 6.6 A simple genetic algorithm (211)
6.2 A multi-parameter grid search (204) 6.7 CLS estimation of the BL model (218)
6.3 The density function of M− (206) 6.8 Minimum order selection (233)
6.4 Sampling Y1 from an estimate of 6.9 Leave-one-out CV order selection (234)
F1 (·|r0 ) (207) 6.10 Selecting a (SS)TARSO model (244)
6.5 k-regime subset SETARMA–CLS
estimation (211)
CHAPTER 7
7.1 Bootstrapped p-values for single-lag 7.3 Bootstrapping p-values of the BDS
tests (276) test statistic (281)
7.2 Permutation-based p-values for 7.4 Bootstrap-based p-values for multivariate
multiple-lag tests (277) serial independence tests (289)
CHAPTER 8
8.1 The Ramsey–Rothman TR test (319) 8.3 The trispectrum-based TR test (324)
8.2 The bispectrum-based TR test (322) 8.4 Resampling scheme (326)
CHAPTER 9
9.1 Loess/Lowess (353) 9.5 Gradient descent boost (371)
9.2 Robust Loess/Lowess (354) 9.6 Bootstrap-based LR-type test (376)
9.3 Resampling scheme for MFDs (357) 9.7 Estimating θ and hT for the
9.4 ACE (361) single-index model (379)
CHAPTER 10
10.1 Bootstrap FI (410) 10.2 Bootstrap bias-corrected FI (411)
CHAPTER 11
11.1 A nonadditivity-type test for 11.4 Multivarite test statistic for VSETAR
nonlinearity (459) (464)
(1)
11.2 Tukey’s nonadditivity-type test for 11.5 LM T,p (m) test statistic for LVSTAR
nonlinearity (460) (469)
(O)
11.3 FT test statistic for 11.6 Bootstrapping the GIRF (489)
nonlinearity (461)
CHAPTER 12
12.1 Bootstrap-based p-values for LRT (509)
List of Examples
CHAPTER 1
1.1 U.S. Unemployment Rate (4) 1.6 Summary Statistics (11)
1.2 EEG Recordings (5) 1.7 Summary Statistics (Cont’d) (14)
1.3 Magnetic Field Data (6) 1.8 Sample ACF and Kendall’s τ (17)
1.4 ENSO Phenomenon (7) 1.9 The Logistic Map (20)
1.5 Climate Change (8) 1.10 EEG Recordings (Cont’d) (22)
CHAPTER 2
2.1 A BL Time Series (33) 2.9 Dynamic Effects of an asMA
2.2 Comparing BL Time Series (35) Model (48)
2.3 Dynamic Effects of a BL Model (36) 2.10 NEAR(1) Model (53)
2.4 ExpAR Time Series (38) 2.11 Skeleton of an AR–NN(2; 0, 1)
2.5 Dynamic Effects of an NLMA Model Model (59)
(40) 2.12 Skeleton of an AR–NN(3; 1, 1, 1)
2.6 Dynamic Effects of a SETAR Model Model (60)
(42) 2.13 A Simulated L2 GNN(2; 1, 1)
2.7 A Simulated CSETAR Process (45) Time Series (63)
2.8 A Simulated SETAR(2; 1, 1)2 Model (46) 2.14 A Two-regime Simulated MS–AR(1)
Time Series (67)
A.1 Impulse Response Analysis (78)
CHAPTER 3
3.1 Evaluating the Top Lyapunov 3.5 SETAR Geometric Ergodicity (99)
Exponent (89) 3.6 Invertibility of an RCMA(1) Model (104)
3.2 An Explicit Expression for γ (92) 3.7 Invertibility of an ASTMA(1) Model (105)
3.3 Numerical Evaluation of γ (93) 3.8 Invertibility of a SETMA Model 108)
3.4 Geometric Ergodicity of the SRE (97)
CHAPTER 4
4.1 Third-order Cumulant and Bispectrum 4.2 Principal Domain of the Subba Rao–
(124) Gabr Gaussianity Test (127)
CHAPTER 5
5.1 ENSO Phenomenon (Cont’d) (173) 5.3 Interpretation of the LM ∗T Test
5.2 U.S. Unemployment Rate (Cont’d) (177) Statistic (186)
CHAPTER 6
6.1 NLS Estimation (201) 6.6 Daily Hong Kong Hang Seng Index
6.2 U.S. Unemployment Rate (Cont’d) (208) (225)
6.3 U.S. Real GNP (212) 6.7 U.S. Unemployment Rate (Cont’d)
6.4 ENSO Phenomenon (Cont’d) (215) (235)
6.5 CLS-based Estimation of a BL Model 6.8 Daily Hong Kong Hang Seng Index
(221) (Cont’d) (239)
CHAPTER 7
7.1 Some Kernel Functions and their 7.5 Dimension of an ExpAR(1) Process
FTs (261) (280)
7.2 An Explicit Expression for Δ Q (·) 7.6 S&P 500 Daily Stock Price Index (283)
7.3 Magnetic Field Data (Cont’d) (273) A.1 NW Kernel Regression Estimation
7.4 U.S. Unemployment Rate (Cont’d) (303)
(276) B.1 Gaussian and Student t copulas (307)
CHAPTER 8
8.1 Exploring a Logistic Map for TR 8.3 Exploring a Time-delayed Hénon
(317) Map for TR (329)
8.2 Exploring a Simulated SETAR
Process for TR (321)
CHAPTER 9
9.1 A Comparison Between Conditional 9.7 Sea Surface Temperatures (Cont’d)
Quantiles (345) (368)
9.2 Old Faithful Geyser (347) 9.8 Quarterly U.S. Unemployment Rate
9.3 Hourly River Flow Data (354) (Cont’d) (372)
9.4 Canadian Lynx Data (Cont’d) (359) 9.9 Quarterly U.S. Unemployment Rate
9.5 Sea Surface Temperatures (362) (Cont’d) (376)
9.6 Sea Surface Temperatures (Cont’d) 9.10 A Monte Carlo Simulation Experiment
(364) (379)
CHAPTER 10
10.1 Forecast Density (393) 10.5 Forecasts from a SETAR(2; 1, 1) Model
10.2 Comparing LS and PI Forecast (407)
Strategies (396) 10.6 FIs for a Simulated SETAR Process
10.3 Comparing NFE and MC Forecasts (412)
(403) 10.7 Hourly River Flow Data (Cont’d) (414)
10.4 Forecasts from an ExpAR(1) Model 10.8 ENSO Phenomenon (Cont’d) (422)
(405)
CHAPTER 11
11.1 Stationarity and Invertibility of a 11.4 An LVSTAR Model with a single CNF
Bivariate BL Model (445) (458)
11.2 A Two-regime Bivariate 11.5 Tree Ring Widths (463)
VSETAR(2; 1, 1) Model (450) 11.6 Tree Ring Widths (Cont’d) (470)
11.3 An LVSTAR Model with Nonlinear 11.7 Forecasting an LVSTAR(1) Model with
Cointegration (456) CNFs (477)
List of Examples 611
CHAPTER 12
12.1 A Monte Carlo Experiment (497) 12.4 Sea Surface Temperatures (Cont’d)
12.2 Daily Returns of Exchange Rates (509)
(499) 12.5 Climate Change (Cont’d) (513)
12.3 Sea Surface Temperatures (Cont’d) 12.6 Climate Change (Cont’d) ( 518)
(504)
612 List of Examples
Table 3: Time series used throughout the book. File names are
given in parentheses.
(1)
First differences of original data.
(2)
Logistic transformation of original data.
Subject index
Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.
Alternative Proxies: