0% found this document useful (0 votes)
219 views

Time Series Project

This document provides an overview and analysis of a time series dataset containing monthly sunshine data from Heathrow Airport in London from 1948 to 2019. The dataset contains 7 variables including year, month, maximum and minimum temperatures, air frost days, rainfall, and sunshine hours. The time series analysis identifies an ARIMA(1,0,0)(0,1,1)12 model as the best fitting ARIMA model based on having the lowest BIC and AIC values. Forecasts from this model closely match the actual values for the 12 held-out observations. Exponential smoothing methods are also considered, with the multiplicative Holt-Winters model identified as the best based on having the lowest MAD and
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
219 views

Time Series Project

This document provides an overview and analysis of a time series dataset containing monthly sunshine data from Heathrow Airport in London from 1948 to 2019. The dataset contains 7 variables including year, month, maximum and minimum temperatures, air frost days, rainfall, and sunshine hours. The time series analysis identifies an ARIMA(1,0,0)(0,1,1)12 model as the best fitting ARIMA model based on having the lowest BIC and AIC values. Forecasts from this model closely match the actual values for the 12 held-out observations. Exponential smoothing methods are also considered, with the multiplicative Holt-Winters model identified as the best based on having the lowest MAD and
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

TIME SERIES ANALYSIS PROJECT

PRIYUSH PROTIM SHARMA

08/12/2019

Overview of DATASET-1
The first time series analyzed in this project corresponds to monthly sunshine data in the
Heathrow Airport, London, collected from an automatic Kipp & Zonen sensor or otherwise
taken from a Campbell Stokes recorder. The first measurement is from January 1948 and the
final measurement from September 2019 for a total of 861 measurements (some of which
are missing).
To correctly read the data into R we inspect the text file that contains the data,
heathrowdata.txt, in Notepad++ and observe the following:
• the file contains 7 variables (you can see in the R output below)
• YYYY- representing the year
• MM - representing the month
• Tmax - representing the daily maximum temperature in degree Celsius
• Tmin - representing the daily minimum temperature in degree Celsius
• AF - representing days of air frost
• Rain - representing the rainfall per mm
• Sun Hours - representing the sunshine hours
• the time series data begins on line 14 (first 13 rows contain general information
about the data, e.g. exact variable meanings etc);
• the data fields are already in column; hence no separator is required;
• the decimal point separator is a period;
• the missing values are denoted with dash;
• the number of lines that contain time series data is 753;
• the time series data is presented in a column from the oldest to the newest
observation

library(forecast)

## Registered S3 method overwritten by 'xts':


## method from
## as.zoo.xts zoo
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo

## Registered S3 methods overwritten by 'forecast':


## method from
## fitted.fracdiff fracdiff
## residuals.fracdiff fracdiff

setwd("C:\\Users\\Robenzo\\OneDrive\\Desktop\\Time series Lab\\pro")

data = read.table("heathrowdata.txt",skip = 13,dec=".",header = F,na.strings=


"---")
colnames(data)=c("Year","month","Tmax","Tmin","AF","RAIN","SUN")

head(data,3) # first 3 rows of the data

## Year month Tmax Tmin AF RAIN SUN


## 1 1948 1 8.9 3.3 NA 85 NA
## 2 1948 2 7.9 2.2 NA 26 NA
## 3 1948 3 14.2 3.8 NA 14 NA

tail(data,3) # last 3 rows of the data

## Year month Tmax Tmin AF RAIN SUN


## 859 2019 7 25.5 14.9 0 50.8 194.5
## 860 2019 8 25.2 14.1 0 33.6 201.2
## 861 2019 9 21.2 11.8 0 63.0 156.8

We are omiting the row without data for sun hours to get a timeseries without missing
value.
data1 = na.omit(data)

In the model fitting process we leave out the last 12 observations. Instead they will be used
later to assess how good the forecasts of our estimated models are by comparing the
forecasts with the actual known values. In other words, we have monthly data from January
1957 to September 2018 for modelling.
series1=data1[1:(nrow(data1)-12),]
actual1=tail(data1$SUN,n=12)
actual1

## [1] 137.0 72.9 40.3 56.4 120.2 119.0 170.1 176.3 170.1 194.5 201.2
## [12] 156.8

# Creating the corresponding time series object.

Z1 <- ts(series1$SUN,start=c(1957,1),frequency = 12)


BEST ARIMA/SARIMA MODEL
We start the modelling process by finding the most suitable ARIMA/SARIMA model. The time
series (without the last 12 observations) including corresponding autocorrelation and
partial autocorrelation plots up to lag of 5 years are given in figure below
From figure, it is clear that the series is stationary. There is a fixed mean and a constant
variance around it.

According to Phillips-Perron unit root test, we should stick to the original series because
the corresponding p-value 0.01 is less than 0.05 (the null hypothesis is rejected).
##
## Phillips-Perron Unit Root Test
##
## data: Z1
## Dickey-Fuller = -11.471, Truncation lag parameter = 6, p-value =
## 0.01

Based on the autocorrelation plots in the above figure we first consider MA(1) or MA(2)
model for seasonal part and AR(1) model for low order part. Our initial candidate model is
ARIMA(1, 0, 0) × (0, 1, 1)12.
m1=Arima(Z1,order=c(1,0,0),seasonal=list(order=c(0,1,1),period=12))
tsdiag(m1,gof.lag=60)

The corresponding diagnostic plots are presented in following figure. From the diagnostic
plots, we see that a few autocorrelations are close to the confidence bounds or slightly
outside which is acceptable. Also, according to Ljung-Box test groups of autocorrelations up
to lag 60 can be considered independent.
Let’s see how much the fit improves if we increase the number of AR terms in lower order
part from one to two.
m2=Arima(Z1,order=c(2,0,0),seasonal=list(order=c(0,1,1),period=12))
tsdiag(m2,gof.lag=60)

As we can see the ACF of the residuals for our second model is also similar to the previous
one, while the Ljung-Box test groups of autocorrelations for this case is slightly better.
Now reducing the number of AR terms from two to one, while increasing the the number of
MA terms in seasonal part from one to two.
m3=Arima(Z1,order=c(1,0,0),seasonal=list(order=c(0,1,2),period=12))
tsdiag(m3,gof.lag=60)
As we can see, there is no clear difference from the other two and can also be considered as
a good fit. The small improvement in the fit with each additional added parameter is not
justified. Thus we don’t consider any models with more parameters and use Bayesian
information criteria(BIC) and Akaike’s ‘An Information Criterion’ (AIC) to choose the best
model from the ones we initially estimated.
BIC(m1,m2,m3)

## df BIC
## m1 3 7151.130
## m2 4 7155.929
## m3 4 7157.594

AIC(m1,m2,m3)

## df AIC
## m1 3 7137.355
## m2 4 7137.563
## m3 4 7139.227

From the R output below we see that our first model has the lowest value of BIC and AIC so
it can be considered the best fit.
To summarize, the most suitable model (acceptable fit and as few parameters as possible)
we found is ARIMA(1, 0, 0) × (0, 1, 1)12.
## Series: Z1
## ARIMA(1,0,0)(0,1,1)[12]
##
## Coefficients:
## ar1 sma1
## 0.1462 -0.9378
## s.e. 0.0368 0.0190
##
## sigma^2 estimated as 1005: log likelihood=-3565.68
## AIC=7137.36 AICc=7137.39 BIC=7151.13

Since the general form of an 𝐴𝑅𝐼𝑀𝐴(1,  0,  0)   ×   (0,  1,  1)12 model is
(1  −  𝜙 1 𝐵)(1  −   𝐵12 )𝑍𝑡   =   (1  −   𝛩1 𝐵12 )𝐴𝑡
then from the above R output we get that the mathematical form of our best model is
(1  − 0.1462 𝐵)(1  −   𝐵12 )𝑍𝑡   =   (1  −  0.9378𝐵12 )𝐴𝑡
In the figure below, we have tried to graph with original and fitted values for all time series
values. As we can easily see, some of the extreme values are not well fitted for the given
model.

The 12 original values, we removed before model fitting, their predictions and
corresponding 95% prediction intervals from our best model are presented in following
Figure. We can see that our predictions are quite good – all 12 are close to the actual values
which are all inside the 95% prediction intervals.
The RMSE of our 12 predictions is approximately 18.40249.
(RMSE1=sqrt(mean((actual1-pred1$pred)^2)))

## [1] 18.40249

pre=predict(m1,n.ahead=16)
pre

## $pred
## Jan Feb Mar Apr May Jun Jul
## 2018
## 2019 57.72940 73.77872 117.22715 166.06601 192.51384 193.68482 205.00296
## 2020 57.70790
## Aug Sep Oct Nov Dec
## 2018 115.39571 68.72677 52.00467
## 2019 185.11734 147.94308 108.51549 67.72080 51.85759
## 2020
##
## $se
## Jan Feb Mar Apr May Jun Jul
## 2018
## 2019 32.04146 32.04146 32.04146 32.04146 32.04146 32.04146 32.04146
## 2020 32.10348
## Aug Sep Oct Nov Dec
## 2018 31.69722 32.03423 32.04140
## 2019 32.04146 32.04146 32.10224 32.10354 32.10357
## 2020

We have predcited 16 values since 12 observations were cut out from the original series
and thus the last four values are the 4 predicted values from our given model :
Oct - 108.51549
Nov - 67.72080
Dec - 51.85759
Jan - 57.70790

Best model based of exponential methods


We move onto finding the best model based on exponential smoothing, i.e. using exponential
smoothing method, Holt’s method and Holt-Winter’s methods (additive and multiplicative).
From the plots in Figure below, it seems that all four methods produce a model that similarly
fits the data. The predictions from exponential smoothing and Holt’s model lag behind the
actual series but otherwise there are no big differences whereas with Holt-Winter’s models
there are some differences around the peaks (turning points of the series). Some periodicity
seems to be present in all the residuals plots so the prediction errors of each model can’t be
considered random. Graphically, there is no clear best fit
We use goodness of fit measures to decide which model is the best fit (please see the code
for the function used to calculate them). From the R output below we see that based on all
three measures (MAD, RMSE and MAPD) Holt-Winter’s model with the multiplicative having
the smallest values for MAD and MAPD. But the Holt-Winter’s model with the additive has
the smallest RMSE and thus we decide to go for the Multiplicative model.
models=c("Exp. smoothing","Holt","Add. Holt-Winters","Mult. Holt-Winters")
gofs=rbind(GoodnessOfFit(exp_smooth),GoodnessOfFit(Holt),
GoodnessOfFit(HW_add),GoodnessOfFit(HW_mult))
data.frame("Model"=models,round(gofs,6))

## Model MAD RMSE MAPD


## 1 Exp. smoothing 39.21715 50.57436 0.360983
## 2 Holt 39.68343 50.90699 0.371951
## 3 Add. Holt-Winters 25.31470 33.10724 0.228982
## 4 Mult. Holt-Winters 25.10157 33.21498 0.222515

Thus the best model is multiplicative Holt-Winter’s with estimated smoothing parameters
alpha, beta and gamma
params=data.frame(rbind("alpha","beta","gamma"),rbind(HW_mult$alpha,HW_mult$b
eta,HW_mult$gamma))
colnames(params)=c("Parameter","Estimate")
params

## Parameter Estimate
## 1 alpha 0.05742941
## 2 beta 0.02320280
## 3 gamma 0.09946926
The 12 original values and their predictions using the best SARIMA model from the last
section and the estimated multiplicative Holt-Winter’s model are presented in figure below

The RMSE of the 12 predictions from multiplicative Holt-Winter’s model is approximately


19.70565 (see R output below) and approximately 18.40249 for the best SARIMA model
(recall from last section). Thus the SARIMA model is a better forecasting model which we
also observed from Figure as well.
## [1] 19.70565
Overview of DATASET_2
The second time series analyzed in this project corresponds to meat purchase. The data is
from January 2004 to September 2019
To correctly read the data into R we inspect the CSV file that contains the data,
AG192s_MEAT.csv, in Notepad++ and observe the following:
• The time series data begins from line 5
• The data fields are semicolon separated ;
• The missing values are denoted with two periods
• The time series data is presented in a column from the oldest to the newest
observation

setwd("C:\\Users\\Robenzo\\OneDrive\\Desktop\\TS")
data2=read.table(file="AG192s_MEAT.csv",skip=4,sep=";",dec=".",header=F,na.st
rings="..")

#Removing the rows without useful data


data3 = na.omit(data2)

head(data3,3) #first 3 values of the data

## V1 V2 V3
## 2 January 1945
## 3 February 1906
## 4 March 2398

tail(data3,3) #last 3 values of the data

## V1 V2 V3
## 203 July 3464
## 204 August 3274
## 205 September 3471

In the model fitting process, we again leave out the last 12 observations. Instead they will be
used later to assess how good the forecasts of our estimated model is by comparing the
forecasts with the actual known values. In other words, we have monthly data from January
2004 to September 2018 for modelling.
series2=data3[1:(nrow(data3)-12),]
actual2=tail(data3$V3,n=12)

#creating corresponding time series


Z = ts(series2$V3,frequency = 12)
BEST ARIMA/SARIMA MODEL
With this series we also begin by finding the most suitable ARIMA/SARIMA model. The time
series (without last 12 observations) including corresponding autocorrelation and partial
autocorrelation plots up to lag of five years are presented in figure.

We take a difference of the series to make it stationary as it doesn’t look that way in the first
place.
plots(diff(Z), nlag = 60)
The first model we try to fit is ARIMA(0, 1, 1) × (1, 0, 0)12, i.e. an AR(1) model for seasonal
part (partial autocorrelations before lags 1 and 2 are significantly out of bounds) and a
MA(1) model for low order part.
m11=Arima(Z,order=c(0,1,1),seasonal=list(order=c(1,0,0),period=12))
tsdiag(m11,gof.lag=60)
Based on the corresponding diagnostic plots this model is a suitable fit – no autocorrelations
are outside the limits or close to them upto lag 1 and all p-values corresponding to Ljung-
Box statistic up are considerably above the mark, except few values between lag 20 and lag
30.
Let’s try another model where MA(1) is increased to MA(3) for the low order part while the
seasonal part remain the same.
ARIMA(0, 1, 3) × (1, 0, 0)12
m22=Arima(Z,order=c(0,1,3),seasonal=list(order=c(1,0,0),period=12))
tsdiag(m22,gof.lag=60)
The p-values corresponding to Ljung-Box statistic for this case is better than the previous
one. Increasing more parameter might help in the p-values but may result in higher BIC and
AIC values.
Comparing the two estimated models using BIC and AIC values, we see that value of BIC and
AIC of the first model is less and hence will be more suitable for our modeling
BIC(m11,m22)

## df BIC
## m11 3 2359.872
## m22 5 2370.203

AIC(m11,m22)
## df AIC
## m11 3 2350.36
## m22 5 2354.35

m11=Arima(Z,order=c(0,1,1),seasonal=list(order=c(1,0,0),period=12))
m11

## Series: Z
## ARIMA(0,1,1)(1,0,0)[12]
##
## Coefficients:
## ma1 sar1
## -0.6072 0.2652
## s.e. 0.0595 0.0785
##
## sigma^2 estimated as 35818: log likelihood=-1172.18
## AIC=2350.36 AICc=2350.5 BIC=2359.87

𝐴𝑅𝐼𝑀𝐴(0,  1,  1)   ×   (1,  0,  0)12


(1  − 𝛷1  𝐵12 )(1 − 𝐵)𝑍𝑡   =   (1  −   𝜃1 𝐵12 )𝐴𝑡

(1  −  0.2652𝐵12 )(1 − 𝐵)𝑍𝑡   =   (1  −  0.6072𝐵12 )𝐴𝑡


The 12 original values we removed before model fitting, their predictions and corresponding
95% prediction intervals from our best model are given in figure. We can see that most of
our predictions are quite good, yet some extreme values are not well predicted.

The RMSE of our 12 predictions is 297.2744.


(RMSE1=sqrt(mean((actual2-pred2$pred)^2)))

## [1] 297.2744

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy