Computer Class 2_time series
Computer Class 2_time series
Time-series data are any data in which time (seconds, minutes, hours, days, weeks,
months, quarters, years) is a significant component of the analysis and the data are
arranged by this time dimension. To use Stata's time-series functions, you must first
make sure that your data are indeed time-series, i.e. you must have a variable that
identifies the time dimension (this variable may or may not be in an easily
recognisable date format). Estimating a time series regression model involves the
same procedures as in standard regression models covered in the previous computer
class. However, before interpreting your results, you must check a few issues that are
specific to regression models using time series data, such as serial correlation, non-
stationarity and the possibility of spurious regression. A non-stationary variable has a
mean, variance and covariance that are not constant over time. If the variables in your
model are non-stationary and you apply OLS to the model, you might end up with the
problem of spurious or non-sense regression. This may not be the case though, if the
variables are cointegrated. Variables are said to be cointegrated when a long run
relationship exists among them. We illustrate some of these concepts in this workshop
with a real life data set.
Open Stata. The data set we are going to use for this workshop is called
consumption.dta and is available in Moodle. Save the data onto a drive of your
choice and open it in Stata (click File>Open>consumption.dta). This dataset
contains quarterly data on consumption and income for Australia over the period
1985:1 to 2002:2 (i.e. for a total of 82 observations).
Before performing any analysis on the data, it is important that Stata recognises which
variable represents the time dimension. The variable called date is not currently in a
format easily recognisable by Stata. (It is possible to convert the date variable using
Stata’s date function, but we will not delve into this here). To identify the time
dimension of the series, we generate a new variable called time which takes the value
1, 2, 3 etc. for 1985:1, 1985:2,....etc.
generate time = _n
Note that _n represents the actual observation number. The above will be correct only
if your data is sorted from the earliest to the last date. You need to check that this is
the case. With some data sets, you need to sort the data on the variable of interest,
otherwise the time variable will not be assigned correct numbers when defined (i.e.
your variables need to be in the correct chronological order).
TIP: _n gives the actual line of observation, while _N gives the total number of observations in
the dataset. Try the following:
generate test = _N
edit test
Now type
count
What do you notice? You can delete this newly created variable test (we will not use it in the
analysis)
drop test
Tip: When you see —more—, if you press ‘Enter’, the screen will advance line by line. If you
press the space bar, the screen will advance page by page. If you don’t want to have the
option “—more—“, type
We now instruct Stata that the time dimension is identified by the newly created
variable time. It is important that you do this otherwise Stata will not be able to use
the automatic time series commands such as those used for performing a Dickey-
Fuller test. To do this, use the tsset command, tsset time. Note that sometimes the
variable representing time in your dataset may not be in numeric format and hence
will not be recognised by Stata. Numeric variables, as the name implies, have
observations that are in number format. Non-numeric variables have observations that
are in string format (e.g. in text). String variables can be converted to numeric using
the destring command and numeric variables can be converted into string variables
using the tostring command (as usual, typing help commandname will launch a
separate window explaining how to use that specific command, along with the various
options available for it).
As we are dealing with time series data, we need to be aware of the potential presence
of unit roots in the variables. Hence, we need to start doing a few checks to see if that
might be the case. Visualising the evolution of the series over time through a graph
would be a good start towards detecting any pattern that might indicate non-
stationarity. To create a line graph for both series, type the following:
The time plot reveals that both consumption and income are trending upwards, a
pattern typical of nonstationary time series data and seem to be moving together.
Hence, a graphical plot of the series reveals that they both appear to be non-stationary
as they are both trending upwards and values of the variables appear to fluctuate
around what appears to be a linear time trend.
Note that plotting two or more series on the same graph is not advisable if they are
measured on significantly different scales, as the series with the largest scale would
dominate the others on the y-axis scale and could wrongly give the impression that no
pattern can be detected from the other series (a solution would be to use two different
y-axes for each series on the same graph).
Suppose we believe that a long run relationship exists between consumption and
income and we wish to estimate that relationship:
𝑐𝑜𝑛𝑠𝑢𝑚𝑝𝑡𝑖𝑜𝑛𝑡 = 𝛽0 + 𝛽1 𝑖𝑛𝑐𝑜𝑚𝑒𝑡 + 𝑢𝑡
You can immediately notice a R-squared value nearly equal to 1 and a very large t-
value on the coefficient of income. High R-Squared and large t statistic tend to be
symptomatic of serial correlation as well as the possible problem of spurious
regression on these series (although we will later see that this is not the case between
these two variables here). To check for serial correlation of a simple AR(1) nature, we
could use the Durbin-Watson test, implemented as follows:
Check the computed Durbin-Watson statistic against table values - the table value of
dL at 1% significance level with 1 explanatory variable (or degree of freedom) and 82
observations is between 1.446 and 1.482. It is clear that (positive) serial correlation of
at least an AR(1) nature exists in the error terms of the above regression model.
Assume for now that these error terms are serially correlated following an AR(1)
process. How would we correct for this? Since the standard errors are problematic
under OLS in the presence of serial correlation, we can use Newey-West corrected
standard errors (which use the seemingly complicated formula). In Stata, the Newey-
West standard errors can be requested by typing:
Note we have used 1 lag because we assumed an AR(1) serial correlation process and
the output is:
As you can observe, the Newey-West corrected standard errors are much larger than
the OLS standard errors. Other ways of correcting for serial correlation are to use
Feasible GLS including the Cochranne-Orcut procedure and the Prais-Winston
transformation that avoids the loss of one observation in the regression (see do file).
Serial correlation could also be of a higher order - AR(2), AR(3),...etc. and we need to
check if that might be the case here. Since the Durbin-Watson statistic is not suitable
for testing serial correlation of a higher order than AR(1), you need to use the
Breusch-Godfrey LM test. Recall that you can also use the Breusch-Godfrey test for
suspected AR(1) serial correlation and you need to use this test if the original time
series model contains the lagged value(s) of the dependent variable on the right hand
side. In the following, we have used 4 lags, but you can try with different lags as well
(in fact, you will notice that serial correlation persists even at very high lagged values).
In practice, the optimal number of lags to be used can be determined by using an
Information Criteria.
The Breusch-Godfrey test also confirms the existence of serial correlation of at least
an AR(4) nature (or even higher if you use higher lags numbers) in the model. We can
gather this from the last column of p values, which are all very significant.
TIP: in the above command, we typed the option “lags(1/4)”, meaning run a serial correlation
test assuming an AR(1) process, AR(2) process, AR(3) process and so on. We could have
equally typed “lags(1 2 3 4)” and get the same answer.
Since our data are in time series format, we need to check whether the regression
model is spurious or not, particularly if the series are non-stationary. Hence, we need
to check if the series are stationary or not. One popular test for non-stationarity is the
Dickey-Fuller test, which is initially written as a random walk model in three variants.
When each of the series is transformed in first differences, we essentially run the
following OLS regressions based on the above three versions on each series:
The unit root test is essentially about testing the following hypotheses on the
coefficient 𝛿:
H0: 𝛿 = 0 (Non-stationary)
Note carefully that for this test, the null hypothesis is non-stationarity, i.e. there is a
unit root in the series. We test for unit root by running OLS on one of the above
models. How do we decide which of models (1), (2) and (3) is more appropriate for
the unit root test? A plot of the series can give you a clue.
As a general principle:
- If the series appears to fluctuate around an average value of zero, use model 1.
Reading from our previous graphs, it appears that our series are each fluctuating
around a linear time trend, and a unit root test on these would call for model 3. If we
perform the unit root test from scratch in its actual regression form:
Note that d.consumption is an operator in Stata which instructs it to use the first
difference of LnCons (i.e. consumptiont – consumptiont-1). Similarly, l.consumption
instructs Stata to use the lagged value of consumption (i.e.consumptiont-1).
Equivalently, the automatic Stata command for the DF test is dfuller, which by
default assumes a Random Walk model with drift and no trend, i.e. model 2. Different
options can be added to the dfuller command to cover models 1 and 3. For instance,
in the below, we add the option trend to ask Stata to run model 3.
The regress option in the dfuller command instructs Stata to report the actual OLS
regression results run on model 2. Note that these results are similar to when we run
the unit root test in its actual regression form. We are interested in the test statistic
value of 1.550. Remember that this test statistic (same as the t-statistic on the
coefficient on 𝛿 in the lower part of the output) does not follow the standard t-
distribution, but rather follows a tau-distribution. This value therefore cannot be
compared against standard t-distribution tables, but should instead be compared
against the critical values given in Dickey-Fuller tables (replicated below):
The tau statistic of 1.55 is less than the critical values at 10% of -3.13 in absolute
values. Note that Stata also simulates and reports critical DF values close to the
Dickey-Fuller tables – at this stage, it advisable to rely on the Dickey-Fuller table.
From our test, we do not reject H0 and confirm that consumption is non-stationary.
Because the error term in the Dickey-Fuller unit root test regression specification may
suffer from the problem of serial correlation, we can use the ADF test to account for
this. The ADF test, as the name suggests, essentially augments the DF test with
lagged values of ∆𝑌𝑡 to correct for serial correlation. When specifying the ADF test,
one can use any of the three models of the DF test. Based on the model 3 version of
the DF test (drift and time trend), we augment the DF regression with p lagged values
of ∆𝑌𝑡 :
The Stata command to implement the ADF test is the same dfuller command but we
need to add the lags() option where the number in brackets will be the number of
lagged ∆𝑌𝑡 terms in the equation. In the below, we use one lag of ∆𝑌𝑡 and then the
command is:
If you want to add two lagged values of ∆𝑌𝑡 then you would specify lags(2) in the
option and so on. Note that we still do not reject H0, confirming that consumption is
non-stationary. The optimal number of lagged values of ∆𝑌𝑡 to be used in the ADF
test can be determined using information criteria like the Akaike Information Criteria
(AIC) or the Schwarz Information Criteria (SIC). Unfortunately, the dfuller command
does not compute these statistics, and hence, they have to be computed manually.
There exists one alternative test called the Dickey-Fuller GLS test which performs a
modified Dickey-Fuller test for a unit root in which the series has been transformed
by a generalized least-squares regression - as opposed to the ordinary least squares
(OLS) method used in the standard DF test. This alternative test is known to be more
robust than the DF test, always includes the trend and intercept, automatically
computes the AIC and SIC statistics for the ADF test, and automatically computes the
test for different number of lagged values of ∆𝑌𝑡 in only one single command (in the
DF test we have to do it one by one). If you are interested to pursue the use of this test
in your own time, the Stata command to implement it is {dfgls variablename,
maxlag(p)} where maxlag is the maximum number of lagged values of ∆𝑌𝑡 you want
to include in your ADF test. In the below, we simply used one lag for both
consumption and income. In practice, we would use an Information Criteria to
determine the number of lags.
Comparing the unit root tests results, both series are confirmed as being non-
stationary in levels as the tau-statistic is less (in absolute terms) than the table Dickey-
Fuller critical values at the 5% level.
Cointegration
Given the two series are non-stationary, there is the possibility that the OLS
regression model we ran on these series at the very beginning might be spurious, in
which case we cannot use the coefficients as reliable estimates of the relationship
between consumption and income. The only exception to this is if the series share a
long run relationship or in time series econometrics jargon, “a cointegrating
relationship”. To check if the two series are cointegrated, we need to carry out a
cointegration test, which involves first running an OLS regression of consumption on
income and then carrying out a unit root test on the estimated residuals. A Dickey-
Fuller test (with no drift and no time trend) can be carried out on the predicted
residuals. The first stage is to run the regression of consumption on income:
Then, predict the residuals and run a unit root test on these residuals:
∆𝑢̂𝑡 = 𝛿𝑢̂𝑡−1 + 𝜀𝑡
Note that this time the test statistic of -3.909 has to be compared with the critical
values from the Engle-Granger Cointegration Test table replicated below (we are
looking at the second line in our case):
Our unit root test shows that the error term is stationary. Hence consumption and
income are cointegrated. This suggests that our initial OLS regression on these
variables was not spurious and we can use these results to understand the long term
relationship between consumption and income. Note that the relevant line you need to
consult in the above table will depend on how your time series model (between X and
Y is formulated), corresponding to No Constant, Constant, Constant and (time) Trend.
If our two series were non-stationary but not cointegrated, in order to avoid spurious
regression, we would have had to run OLS on the differenced series. Whether we use
the first difference, second difference or higher depends on how many times we have
to difference the series to make them stationary. In practice, most non-stationary
series will become stationary after first differencing – they are said to be I(1) in this
case. In this example, our series are I(1). You can check this by running a Dickey-
Fuller test on the first difference of the series:
The first difference of consumption and income are both stationary. Hence, both our
series are said to be integrated of order 1, or I(1). That is we only had to difference
them once for them to become stationary. If we had to difference them twice to
become stationary, we would say they are I(2), and so on. Most economic time series
tend to be I(1).
Our series are non-stationary but suppose our Engle-Granger test revealed that they
were not cointegrated (i.e. the unit root test on the predicted residual did not reject the
null of non-stationarity). In this case, then we would need to run the OLS regression
on the first difference of these variables in order to avoid the risk of the regression
results being spurious (recall that OLS on stationary series is fine):
∆𝑐𝑜𝑛𝑠𝑢𝑚𝑝𝑡𝑖𝑜𝑛𝑡 = 𝛽0 + 𝛽1 ∆𝑖𝑛𝑐𝑜𝑚𝑒𝑡 + 𝑢𝑡