Time Series Project
Time Series Project
08/12/2019
Overview of DATASET-1
The first time series analyzed in this project corresponds to monthly sunshine data in the
Heathrow Airport, London, collected from an automatic Kipp & Zonen sensor or otherwise
taken from a Campbell Stokes recorder. The first measurement is from January 1948 and the
final measurement from September 2019 for a total of 861 measurements (some of which
are missing).
To correctly read the data into R we inspect the text file that contains the data,
heathrowdata.txt, in Notepad++ and observe the following:
• the file contains 7 variables (you can see in the R output below)
• YYYY- representing the year
• MM - representing the month
• Tmax - representing the daily maximum temperature in degree Celsius
• Tmin - representing the daily minimum temperature in degree Celsius
• AF - representing days of air frost
• Rain - representing the rainfall per mm
• Sun Hours - representing the sunshine hours
• the time series data begins on line 14 (first 13 rows contain general information
about the data, e.g. exact variable meanings etc);
• the data fields are already in column; hence no separator is required;
• the decimal point separator is a period;
• the missing values are denoted with dash;
• the number of lines that contain time series data is 753;
• the time series data is presented in a column from the oldest to the newest
observation
library(forecast)
We are omiting the row without data for sun hours to get a timeseries without missing
value.
data1 = na.omit(data)
In the model fitting process we leave out the last 12 observations. Instead they will be used
later to assess how good the forecasts of our estimated models are by comparing the
forecasts with the actual known values. In other words, we have monthly data from January
1957 to September 2018 for modelling.
series1=data1[1:(nrow(data1)-12),]
actual1=tail(data1$SUN,n=12)
actual1
## [1] 137.0 72.9 40.3 56.4 120.2 119.0 170.1 176.3 170.1 194.5 201.2
## [12] 156.8
According to Phillips-Perron unit root test, we should stick to the original series because
the corresponding p-value 0.01 is less than 0.05 (the null hypothesis is rejected).
##
## Phillips-Perron Unit Root Test
##
## data: Z1
## Dickey-Fuller = -11.471, Truncation lag parameter = 6, p-value =
## 0.01
Based on the autocorrelation plots in the above figure we first consider MA(1) or MA(2)
model for seasonal part and AR(1) model for low order part. Our initial candidate model is
ARIMA(1, 0, 0) × (0, 1, 1)12.
m1=Arima(Z1,order=c(1,0,0),seasonal=list(order=c(0,1,1),period=12))
tsdiag(m1,gof.lag=60)
The corresponding diagnostic plots are presented in following figure. From the diagnostic
plots, we see that a few autocorrelations are close to the confidence bounds or slightly
outside which is acceptable. Also, according to Ljung-Box test groups of autocorrelations up
to lag 60 can be considered independent.
Let’s see how much the fit improves if we increase the number of AR terms in lower order
part from one to two.
m2=Arima(Z1,order=c(2,0,0),seasonal=list(order=c(0,1,1),period=12))
tsdiag(m2,gof.lag=60)
As we can see the ACF of the residuals for our second model is also similar to the previous
one, while the Ljung-Box test groups of autocorrelations for this case is slightly better.
Now reducing the number of AR terms from two to one, while increasing the the number of
MA terms in seasonal part from one to two.
m3=Arima(Z1,order=c(1,0,0),seasonal=list(order=c(0,1,2),period=12))
tsdiag(m3,gof.lag=60)
As we can see, there is no clear difference from the other two and can also be considered as
a good fit. The small improvement in the fit with each additional added parameter is not
justified. Thus we don’t consider any models with more parameters and use Bayesian
information criteria(BIC) and Akaike’s ‘An Information Criterion’ (AIC) to choose the best
model from the ones we initially estimated.
BIC(m1,m2,m3)
## df BIC
## m1 3 7151.130
## m2 4 7155.929
## m3 4 7157.594
AIC(m1,m2,m3)
## df AIC
## m1 3 7137.355
## m2 4 7137.563
## m3 4 7139.227
From the R output below we see that our first model has the lowest value of BIC and AIC so
it can be considered the best fit.
To summarize, the most suitable model (acceptable fit and as few parameters as possible)
we found is ARIMA(1, 0, 0) × (0, 1, 1)12.
## Series: Z1
## ARIMA(1,0,0)(0,1,1)[12]
##
## Coefficients:
## ar1 sma1
## 0.1462 -0.9378
## s.e. 0.0368 0.0190
##
## sigma^2 estimated as 1005: log likelihood=-3565.68
## AIC=7137.36 AICc=7137.39 BIC=7151.13
Since the general form of an 𝐴𝑅𝐼𝑀𝐴(1, 0, 0) × (0, 1, 1)12 model is
(1 − 𝜙 1 𝐵)(1 − 𝐵12 )𝑍𝑡 = (1 − 𝛩1 𝐵12 )𝐴𝑡
then from the above R output we get that the mathematical form of our best model is
(1 − 0.1462 𝐵)(1 − 𝐵12 )𝑍𝑡 = (1 − 0.9378𝐵12 )𝐴𝑡
In the figure below, we have tried to graph with original and fitted values for all time series
values. As we can easily see, some of the extreme values are not well fitted for the given
model.
The 12 original values, we removed before model fitting, their predictions and
corresponding 95% prediction intervals from our best model are presented in following
Figure. We can see that our predictions are quite good – all 12 are close to the actual values
which are all inside the 95% prediction intervals.
The RMSE of our 12 predictions is approximately 18.40249.
(RMSE1=sqrt(mean((actual1-pred1$pred)^2)))
## [1] 18.40249
pre=predict(m1,n.ahead=16)
pre
## $pred
## Jan Feb Mar Apr May Jun Jul
## 2018
## 2019 57.72940 73.77872 117.22715 166.06601 192.51384 193.68482 205.00296
## 2020 57.70790
## Aug Sep Oct Nov Dec
## 2018 115.39571 68.72677 52.00467
## 2019 185.11734 147.94308 108.51549 67.72080 51.85759
## 2020
##
## $se
## Jan Feb Mar Apr May Jun Jul
## 2018
## 2019 32.04146 32.04146 32.04146 32.04146 32.04146 32.04146 32.04146
## 2020 32.10348
## Aug Sep Oct Nov Dec
## 2018 31.69722 32.03423 32.04140
## 2019 32.04146 32.04146 32.10224 32.10354 32.10357
## 2020
We have predcited 16 values since 12 observations were cut out from the original series
and thus the last four values are the 4 predicted values from our given model :
Oct - 108.51549
Nov - 67.72080
Dec - 51.85759
Jan - 57.70790
Thus the best model is multiplicative Holt-Winter’s with estimated smoothing parameters
alpha, beta and gamma
params=data.frame(rbind("alpha","beta","gamma"),rbind(HW_mult$alpha,HW_mult$b
eta,HW_mult$gamma))
colnames(params)=c("Parameter","Estimate")
params
## Parameter Estimate
## 1 alpha 0.05742941
## 2 beta 0.02320280
## 3 gamma 0.09946926
The 12 original values and their predictions using the best SARIMA model from the last
section and the estimated multiplicative Holt-Winter’s model are presented in figure below
setwd("C:\\Users\\Robenzo\\OneDrive\\Desktop\\TS")
data2=read.table(file="AG192s_MEAT.csv",skip=4,sep=";",dec=".",header=F,na.st
rings="..")
## V1 V2 V3
## 2 January 1945
## 3 February 1906
## 4 March 2398
## V1 V2 V3
## 203 July 3464
## 204 August 3274
## 205 September 3471
In the model fitting process, we again leave out the last 12 observations. Instead they will be
used later to assess how good the forecasts of our estimated model is by comparing the
forecasts with the actual known values. In other words, we have monthly data from January
2004 to September 2018 for modelling.
series2=data3[1:(nrow(data3)-12),]
actual2=tail(data3$V3,n=12)
We take a difference of the series to make it stationary as it doesn’t look that way in the first
place.
plots(diff(Z), nlag = 60)
The first model we try to fit is ARIMA(0, 1, 1) × (1, 0, 0)12, i.e. an AR(1) model for seasonal
part (partial autocorrelations before lags 1 and 2 are significantly out of bounds) and a
MA(1) model for low order part.
m11=Arima(Z,order=c(0,1,1),seasonal=list(order=c(1,0,0),period=12))
tsdiag(m11,gof.lag=60)
Based on the corresponding diagnostic plots this model is a suitable fit – no autocorrelations
are outside the limits or close to them upto lag 1 and all p-values corresponding to Ljung-
Box statistic up are considerably above the mark, except few values between lag 20 and lag
30.
Let’s try another model where MA(1) is increased to MA(3) for the low order part while the
seasonal part remain the same.
ARIMA(0, 1, 3) × (1, 0, 0)12
m22=Arima(Z,order=c(0,1,3),seasonal=list(order=c(1,0,0),period=12))
tsdiag(m22,gof.lag=60)
The p-values corresponding to Ljung-Box statistic for this case is better than the previous
one. Increasing more parameter might help in the p-values but may result in higher BIC and
AIC values.
Comparing the two estimated models using BIC and AIC values, we see that value of BIC and
AIC of the first model is less and hence will be more suitable for our modeling
BIC(m11,m22)
## df BIC
## m11 3 2359.872
## m22 5 2370.203
AIC(m11,m22)
## df AIC
## m11 3 2350.36
## m22 5 2354.35
m11=Arima(Z,order=c(0,1,1),seasonal=list(order=c(1,0,0),period=12))
m11
## Series: Z
## ARIMA(0,1,1)(1,0,0)[12]
##
## Coefficients:
## ma1 sar1
## -0.6072 0.2652
## s.e. 0.0595 0.0785
##
## sigma^2 estimated as 35818: log likelihood=-1172.18
## AIC=2350.36 AICc=2350.5 BIC=2359.87
## [1] 297.2744