Forecasting
Forecasting
Forecasting
Samuel Chan
Updated: January 31, 2019
Contents
Background 1
Algoritma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Libraries and Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Training Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Graded Quizzes 71
Learn-by-Building 71
Extra Content: 72
Using Prophet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Annotations 75
Before you go ahead and run the codes in this coursebook, it’s often a good idea to go through some initial
setup. Under the Libraries and Setup tab you’ll see some code to initialize our workspace, and the libraries
we’ll be using for the projects. You may want to make sure that the libraries are installed beforehand by
referring back to the packages listed here. Under the Training Focus tab we’ll outline the syllabus, identify
the key objectives and set up expectations for each module.
Background
Algoritma
The following coursebook is produced by the team at Algoritma for its Data Science Academy workshops. The
coursebook is intended for a restricted audience only, i.e. the individuals and organizations having received
this coursebook directly from the training organization. It may not be reproduced, distributed, translated or
adapted in any form outside these individuals and organizations without permission.
Algoritma is a data science education center based in Jakarta. We organize workshops and training programs
to help working professionals and students gain mastery in various data science sub-fields: data visualization,
machine learning, data modeling, statistical inference etc.
1
Libraries and Setup
We’ll set-up caching for this notebook given how computationally expensive some of the code we will write
can get.
knitr::opts_chunk$set(cache=TRUE)
options(scipen = 9999)
You will need to use install.packages() to install any packages that are not already downloaded onto your
machine. You then load the package into your workspace using the library() function:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(forecast)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(TTR)
library(fpp)
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
2
Training Objectives
Decomposition of time series allows us to learn about the underlying seasonality, trend and random fluctuations
in a systematic fashion. In the next 9 hours, we’ll learn the methods to account for seasonality and trend,
work with autocorrelation models and create industry-scale forecasts using modern tools and frameworks.
• Working with Time Series
• Log-Transformation
• Decomposition
• Two-sided SMA
• Understanding Lags
• Plotting Forecasts
• Multiple-Seasonality
3
• SSE & Standard Errors
• General Election 2008: Voter turnout is 76.0% (8.1 million casted their vote)
1 T.Raicharoen, C.Lursinsap, P. Sanguanbhoki, Application of critical support vector machine to time series prediction
2 G.E.P. Box, G.Jenkins, Time Series Analysis, Forecasting and Control
3 G.P. Zhang, Time Series forecasting using a hybrid ARIMA and neural network model
4
Gathering only the above data, what would you predict the turnout size to be in the subsequent election
(2013)?
Our forecast is likely going to be a poor estimate for the succeeding observation. In 2013, 11.3 million turns
up to cast their votes, which is a big deviation from the past where we observe increments of +0.5 to +0.9
million. In 2018 (the most recent General Election that has concluded less than a week ago as I wrote this),
15 million casted their votes. Both in 2013 and 2018, more than 82% show up to cast their votes - again a
significant deviation from empirical observations.
Time series data doesn’t just apply to political campaigns - common application areas of time series include:
• finance: Technical trading strategies using daily share price fluctuation, or tracing the fluctuation of a
current daily exchange rate, etc.
• marketing: Predicting global demand for a beverage using aggregated monthly sales data
• economics: Monthly statistics on unemployment, personal income tax, gov’t expenditure records
• socio-environmental: Periodic records on hospital admissions, rainfall, air quality readings, seasonal
influenza-associated deaths, energy demand forecasting
• science: ECG brain wave activity every 2−8 (0.004) seconds
Once you’ve identify an opportunity for forecasting (where past data is in fact a good indication of what may
be in the future), R provides a very wide set of tools that will help us work with time series and that shall be
the focus of this workshop.
Time Series in R
R has a rich set of functionality for analyzing and working with time series data and its open source community
has invested substantial effort into giving it the infrastructure for representing, visualizing and forecasting
time series data. The “class” we’ll be working with is ts (stands for “time series”), and an object of class
ts is base R’s built-in way of storing and representing regularly spaced time series. It is a useful class to
storing and working with annually, monthly, quarterly, daily, hourly data or even more regular intervals.
The documentation of this function reads that a time series object represent data which has been sampled at
equispaced points in time 4 .
When we convert data into a time series object, we need to specify a value for the frequency parameter.
This value indicates the number of observations in a natural period. Examples:
• We would use a value of 7 for frequency when the data are sampled daily and the natural time period
we’d like to consider is a week
• We would use a value of 12 when the data are samped monthly and the natural time period is a year
Dive Deeper: - What would you use as a reasonable value for frequency if you are working with daily data
and the natural time period is a year?
Let’s take a look at how we can convert a generic data frame column into a univariate time series object.
We’ll read in some data which I’ve prepared in your directory. This is a dataset5 that consists of 6 attributes
representing various gas emissions contributed to Indonesia’s atmosphere.
The data inspects the amount of these emissions starting from the year 1970 until the end of 2012. We’ll take
a look at the structure of the data we read in:
co2 <- read.csv("data_input/environment_1970f.csv")
str(co2)
4R Documentation, Time Series Object
5 World Development Indicators, The World Bank DataBank. Retrieved 30/6/2017
5
## 'data.frame': 43 obs. of 7 variables:
## $ year : int 19
## $ CO2.emissions..kt. : num 35
## $ CO2.emissions..metric.tons.per.capita. : num 0.
## $ Methane.emissions..kt.of.CO2.equivalent. : num 12
## $ Nitrous.oxide.emissions..thousand.metric.tons.of.CO2.equivalent. : num 51
## $ Other.greenhouse.gas.emissions..HFC..PFC.and.SF6..thousand.metric.tons.of.CO2.equivalent.: num 13
## $ Total.greenhouse.gas.emissions..kt.of.CO2.equivalent. : num 33
The 2nd to 7th variables indicate respectively:
1. CO2 emissions (metric tons per capita):
Carbon dioxide emissions are those stemming from the burning of fossil fuels and the manufacture of
cement. They include carbon dioxide produced during consumption of solid, liquid, and gas fuels and
gas flaring.
2. CO2 emissions (kt):
Carbon dioxide emissions are those stemming from the burning of fossil fuels and the manufacture of
cement. They include carbon dioxide produced during consumption of solid, liquid, and gas fuels and
gas flaring.
3. Methane emissions (kt of CO2 equivalent):
Methane emissions are those stemming from human activities such as agriculture and from industrial
methane production.
4. Nitrous oxide emissions (thousand metric tons of CO2 equivalent):
Nitrous oxide emissions are emissions from agricultural biomass burning, industrial activities, and
livestock management.
5. Other greenhouse gas emissions, HFC, PFC and SF6 (thousand metric tons of CO2 equivalent):
Other greenhouse gas emissions are by-product emissions of hydrofluorocarbons, perfluorocarbons, and
sulfur hexafluoride.
6. Total greenhouse gas emissions (kt of CO2 equivalent)
Total greenhouse gas emissions in kt of CO2 equivalent are composed of CO2 totals excluding short-
cycle biomass burning (such as agricultural waste burning and Savannah burning) but including other
biomass burning (such as forest fires, post-burn decay, peat fires and decay of drained peatlands), all
anthropogenic CH4 sources, N2O sources and F-gases (HFCs, PFCs and SF6).
The data is yearly observations between 1970 and 2012:
range(co2$year)
If you take a look at the co2_ts object we just created, we would see that it is now a Time Series object
storing the values from $CO2.emissions..metric.tons.per.capita.:
co2_ts
## Time Series:
## Start = 1970
6
## End = 2012
## Frequency = 1
## [1] 0.3119519 0.3306215 0.3580080 0.3954703 0.4021567 0.4128050 0.4612390
## [8] 0.6002978 0.6677802 0.6601457 0.6426495 0.6634071 0.6822242 0.6640976
## [15] 0.6944021 0.7347680 0.7229173 0.7184145 0.7552095 0.7348064 0.8243417
## [22] 0.9735380 1.0788747 1.1452296 1.1416286 1.1420774 1.2669930 1.3738789
## [29] 1.0218517 1.1599925 1.2452416 1.3748183 1.4102338 1.4364045 1.5098982
## [36] 1.5084806 1.5015768 1.6118554 1.7638951 1.8651654 1.7679079 2.3335854
## [43] 2.4089201
We can use plot on our time series:
plot(co2_ts)
2.0
1.5
co2_ts
1.0
0.5
Time
We can subset a time series using the window() function and specify the start and end for the time window
to be selected:
# subset the time series (June 2014 to December 2014)
plot(window(co2_ts, start=1980, end=2000))
7
window(co2_ts, start = 1980, end = 2000)
1.4
1.2
1.0
0.8
Time
What we just observe above is a time series with a clear trend but without any clear indication of a cyclic
pattern - time series like this are generally less-interesting because there are no repeated observations for
each period, making it impossible to study the underlying patterns.
In other words, you can’t decompose the above time series any further, for each year there is exactly one
observation (frequency = 1): and you can’t extract any pattern of seasonality from a time series where
frequency is 1.
Dive Deeper: presidents is a built-in dataset recording the quarterly Approval Ratings of US Presidents.
1. What is the class of presidents? If it is not a time series object, convert it to a time series object.
2. Plot the time series.
3. 2 hours and 8 minutes after John F.Kennedy’s assassination, amid suspicions of a conspiracy against
the government, Lyndon Johnson was sworn in as president. His presidential term is between the last
quarter of 1963 (hint: start=c(1963,4)) to the last quarter of 1968. Create a time series plot for that
time window using the appropriate start and end parameters for the window() function.
4. Was Lydon Johnson a popular president during his presidential term?
5. As a bonus exercise, give the plot an appropriate main title and axis labels (x and y).
Reference Answer: Lydon Johnson is arguably one of the least popular president in US history, with backlash
stemming from “Frustration over Vietnam; too much federal spending and. . . taxation; no great public
support for your Great Society programs; and . . . public disenchantment with the civil rights programs”.
Famously, he “could scarcely travel anywhere without facing protests” as people protested and chanted “Hey,
hey, LBJ, how many kids did you kill today?”6
6 Lyndon Johnson left office as a deeply unpopular president. So why is he so admired today?
8
plot(window(presidents, start=c(1963,4), end=c(1968,4)))
window(presidents, start = c(1963, 4), end = c(1968, 4))
80
70
60
50
40
Time
Let’s take a look at the following data instead. The following is monthly data starting from January 1946
corresponding to the number of births (thousands) per month in New York City:
births <- read.csv("data_input/nybirth.csv")
str(births)
9
points(
x = births[births$month == 2, "date"],
y = births[births$month == 2, "births"], col="darkred", pch=19)
points(
x = births[births$month == 7, "date"],
y = births[births$month == 7, "births"], col="dodgerblue4", pch=19)
axis(1, births$date[seq(2, length(births$date), by = 12)], format(births$date[seq(2, length(births$date)
30
28
births$births
26
24
22
20
Feb 46 Feb 47 Feb 48 Feb 49 Feb 50 Feb 51 Feb 52 Feb 53 Feb 54 Feb 55 Feb 56 Feb 57 Feb 58 Feb 59
births$date
10
30
28
26
births_ts
24
22
20
Time
• Residuals (Et ): Irregular components or random fluctuations not captured by the Trend and Seasonal
The idea of classical decomposition is to create separate models for each of the three elements in the
aim of describing our series either additively:
Xt = Tt + St + Et
or multiplicatively: Xt = Tt · St · Et
When we use the decompose function in R, we’re performing the classical seasonal decomposition using
moving averages (a concept we’ll get into). We can optionally specify a type of the seasonal component by
using the type parameter. By default, it assumes an additive model but this can be changed: decompose(x,
type="multiplicative")
str(decompose(births_ts))
## List of 6
## $ x : Time-Series [1:168] from 1946 to 1960: 26.7 23.6 26.9 24.7 25.8 ...
## $ seasonal: Time-Series [1:168] from 1946 to 1960: -0.677 -2.083 0.863 -0.802 0.252 ...
## $ trend : Time-Series [1:168] from 1946 to 1960: NA NA NA NA NA ...
## $ random : Time-Series [1:168] from 1946 to 1960: NA NA NA NA NA ...
11
## $ figure : num [1:12] -0.677 -2.083 0.863 -0.802 0.252 ...
## $ type : chr "additive"
## - attr(*, "class")= chr "decomposed.ts"
Observe the plot of your time series again - an additive model seems a good fit for the time series since the
seasonal variation appear to be rather constant across the observed period (x-axis). Note that a series with
multiplicative effects can often be transformed into one with additive effect through a log transformation, a
technique we’ve discussed in your Classification 1 modules.
Dive Deeper: Monthly Airline Passenger, 1949-1960 1. Look at the following plot of an airline
passenger time series. Recall that an additive model is appropriate if the seasonal and random variation is
constant. Do you think this time series is better described using an additive or multiplicative model?
data("AirPassengers")
plot(AirPassengers)
600
500
AirPassengers
400
300
200
100
Time
2. Is it possible to transform the air passengers data such that it can be described using an additive model?
# your code below:
Tip (Extra Intuition): If you needed help with (2) above, recall about the log-transformation we’ve learned
in the chapter of Logistic Regression from the Classification 1 workshop. A series like the following seems
multiplicative (the values increase by a magnitude of ~4.5 from the previous value). However, wrap them in
log() and you’ll transform the series into an additive one:
c(7.389056, 33.115451, 148.413159, 665.141633, 2980.957987)
12
this agrees with the definition of the decomposition process as well as our “visual” inspection earlier. In
February, a maximum loss of -2.08 is applied to the additive equation while in July we see a maximum gain
of 1.45 being applied to the additive equation, corresponding to the dips and peaks in these two months:
births_dc <- decompose(births_ts)
births_dc$seasonal
• The second panel plots the trend component, using a moving average with a symmetric window
with equal weights. This means that for each point on the series, a new value is estimated by taking
the average of the point itself +- 6 points (since frequency = 12)
• The third panel plots the seasonal component, with the figure being computed by taking the average
for each time unit over all periods and then centering it around the mean
• The bottom-most panel the error component, which is determined by removing the trend and seasonal
figure
13
plot(births_dc)
Time
14
28
27
26
births_sma
25
24
23
22
Time
So to solidify our understanding of the classical decomposition process, let’s study the mathematical details
behind decompose() by comparing our manual calculation to the one we obtain from decompose.
When we estimate the Trend component from a time series by using a moving average, recall that the
specification is to have a symmetric window with equal weights. For a time series with 12 as the frequency
value, the weight assigned to each observations within that considered window would be the same:
c(0.5, rep_len(1, 12-1), 0.5)/12
15
## [1] 23.98433
To calculate the moving average of observation #8, we move the considered window by one unit of time to
the right and then apply the weight similarly to the above:
sum(
as.vector(births_ts[2])*0.04166667,
as.vector(births_ts[3:13])*0.08333333,
as.vector(births_ts[14])*0.04166667
)
## [1] 23.66212
If that seems tedious, the good news is that we can make use of the filter() function which by default
apply our weights / coefficients to both sides of the point. Observe that the result for observation 7 and 8 is
the equivalent to our manual calculation above:
coef1 <- c(0.5, rep_len(1, frequency(births_ts)-1), 0.5)/frequency(births_ts)
trn <- stats::filter(births_ts, coef1)
trn
16
27
26
25
trn
24
23
22
Time
Compare the figures you’ve arrived at manually to the trend component output from the decompose()
function. They are the same.
Now that we know how to arrive manually at the values of our Trend component, let’s see how we arrive at
the Seasonal component of the time series. Recall that the frequency of our time series is 12, and we will use
that to create a sequence consisting of the multiples of 12. Informally, we are going to take the average of the
values in this sequence later, add 1 to each value in the sequence and again record the average, then add 1
again - and repeat this 12 times so we end up with a series of length 12, each being the mean of the values in
each iteration.
f <- frequency(births_ts)
index <- seq.int(1L, length(births_ts), by = f) - 1L
index
for (i in 1L:f)
figure[i] <- mean(detrend[index + i], na.rm = TRUE)
17
# if multiplicative: figure / mean(figure)
figure <- figure - mean(figure)
Plot the time series. Can you tell if an additive model is appropriate for this time series?
Write your solution for (Q3) below:
18
souvenir <- scan("data_input/fancy.dat", quiet = T)
souvenir_ts <- ts(souvenir, frequency = 12, start = c(1987,1))
plot.ts(souvenir_ts)
100000
souvenir_ts
60000
20000
0
Time
In this case it appears an additive model is not appropriate for describing this time series, since both the
size of the seasonal fluctuations and size of the random fluctuations seem to increase with the
level of the time series. Thus, we may need to transform the time series and get a transformed time series
that can be described using an additive model:
souvenir_lts <- log(souvenir_ts)
plot.ts(souvenir_lts)
19
11
souvenir_lts
10
9
8
Time
Now observe that the size of the seasonal fluctuations and random fluctuations in the log-
transformed time series are roughly constant and no longer dependent on the level of the time
series. This log-transformed time series can better be described by an additive model. We still observe
seasonal fluctuations in our data due to the seasonal components, but what if we observe the sales trend after
having it adjusted for the seasonality effect? Would it still be generally upward-trending, or is our optimism
unwarranted?
The following code first decompose the log-transformed time series, and then remove the seasonal component
from the time series. We can see that, having adjusted for seasonality our sales have still seen an upward-rising
trend:
souvenir_dc <- decompose(souvenir_lts)
souvenir_sadj <- souvenir_lts - souvenir_dc$seasonal
plot(souvenir_sadj)
20
10.0
souvenir_sadj
9.5
9.0
8.5
8.0
Time
So far we’ve been working with time series data with a seasonality component. In the following sub-section,
let’s take a look at how we can work with non-seasonal data.
21
2.0
1.5
co2_tsm
1.0
0.5
Time
There still appears to be some random fluctuations with a simple moving average of order 3. To estimate the
trend component more accurately, we may want to use smoothing with a higher order:
co2_tsm <- SMA(co2_ts, n=5)
plot(co2_tsm)
22
2.0
1.5
co2_tsm
1.0
0.5
Time
The data smoothed with a simple moving average of order 5 gives a clearer picture of the trend component,
and we can see that the level of co2 emissions contributed to Indonesia’s atmosphere have risen rather sharply
in the more recent years (2005-2012). The increase was much sharper than in the past, say in the 20 years
between 1970 to 1990.
As a reminder, recall that SMA calculates the arithmetic mean of the series over the past n observations. This
k
1
P
is essentially a one-sided moving average and can be represented in the following form: zt = k+1 yt−j
j=0
k
1
P
Compare this to a two-sided moving average which takes the form of: zt = 2k+1 yt+j
j=−k
Where k is the number of observations to consider in your moving average window and t is the observation
we’re replacing with a moving average.
## [1] 2.027895
23
## [1] 2.099375
And if we were to plot our observation, along with the simple moving average (+3 future points) then we
have a forecast model (even though a simple one!):
plot(co2$CO2.emissions..metric.tons.per.capita., type="l")
co2_sma <- SMA(co2_ts, n=5)
lines(c(co2_sma, future1, future2, future3), col="dodgerblue4", lty=2)
co2$CO2.emissions..metric.tons.per.capita.
2.0
1.5
1.0
0.5
0 10 20 30 40
Index
24
hence using the full dataset, will give us the best estimate of a future value. Naturally, a longer observation
period will average out the fluctuations and reduce the effect of variability.
Why do we assume that the changes are non-systematic? Because if there was any systematic changes in our
mean, it would have been attributed to the seasonal or trend component and we wouldn’t have flatly used
the mean as a forecast.
Another reasonable strategy you may propose is to weigh our observations such that recent observations have
a stronger say in determining the future value than observations in the distant past. This brings us to the
exponential smoothing method of forecasting in the next chapter.
25
20
Time
8 Duke University, Statistical forecasting: notes on regression and time series analysis
25
Observe that the mean hovers around 25 inches rather constantly throughout the 100 years and the random
fluctuations also seem to be roughly constant in size, so it’s probably appropriate to describe the data using
an additive model. We’ve seen how a SMA would have worked on the above time series but here we’ll see if a
simple exponential smoothing algorithm is up to the task.
Comparison to SMA
Conceptually, where a single moving average assigns equal weight to n number of past observations, exponential
smoothing assigns exponentially decreasing weights as the observation gets older and commonly consider the
full data. Through the assignment of weights, we can then set our forecast to be more or less influenced by
recent observations as compared to older observations.
In most cases an alpha parameter smaller than 0.40 is often effective. However, one may perform a grid
search of the parameter space, with = 0.1 to = 0.9, with increments of 0.1. Then the best alpha has the
smallest Mean Absolute Error (MA Error).
26
The function I use here is ets(), which accepts an optional model specification and an alpha value of 0.2.
The model is specified through a three-character string, where:
- First letter denotes the error type (“A”, “M” or “Z”)
- Second letter denotes the trend type (“N”“,”A“,”M" or “Z”)
- Third letter denotes the seasonality (“N”“,”A“,”M" or “Z”)
- Where N indicates “none”, “A” indicates additive, “M” indicates multiplicative and “Z” for automatic
selection
Let’s create an exponential smoothing that only consider the error ignoring any trend and seasonality in our
time series:
library(forecast)
co2_ets <- ets(co2_ts, model="ANN", alpha=0.2)
co2_ets$fitted
## Time Series:
## Start = 1970
## End = 2012
## Frequency = 1
## [1] 0.4290483 0.4056290 0.3906275 0.3841036 0.3863769 0.3895329 0.3941873
## [8] 0.4075976 0.4461377 0.4904662 0.5244021 0.5480516 0.5711227 0.5933430
## [15] 0.6074939 0.6248755 0.6468540 0.6620667 0.6733363 0.6897109 0.6987300
## [22] 0.7238523 0.7737895 0.8348065 0.8968911 0.9458386 0.9850864 1.0414677
## [29] 1.1079499 1.0907303 1.1045827 1.1327145 1.1811353 1.2269550 1.2688449
## [36] 1.3170555 1.3553406 1.3845878 1.4300413 1.4968121 1.5704827 1.6099678
## [43] 1.7546913
co2_ets$residual
## Time Series:
## Start = 1970
## End = 2012
## Frequency = 1
## [1] -0.11709643 -0.07500754 -0.03261950 0.01136666 0.01577974
## [6] 0.02327215 0.06705166 0.19270012 0.22164253 0.16967954
## [11] 0.11824747 0.11535555 0.11110153 0.07075465 0.08690815
## [16] 0.10989249 0.07606323 0.05634783 0.08187320 0.04509546
## [21] 0.12561166 0.24968568 0.30508525 0.31042305 0.24473744
## [26] 0.19623883 0.28190663 0.33241119 -0.08609825 0.06926220
## [31] 0.14065889 0.24210384 0.22909853 0.20944949 0.24105334
## [36] 0.19142506 0.14623622 0.22726760 0.33385382 0.36835332
## [41] 0.19742513 0.72361760 0.65422886
Observe that we can confirm the formula Ft+1 = Ft + αEt :
co2_ets$fitted[1] + 0.2 * co2_ets$residual[1]
## [1] 0.405629
co2_ets$fitted[2] + 0.2 * co2_ets$residual[2]
## [1] 0.3906275
co2_ets$fitted[3] + 0.2 * co2_ets$residual[3]
## [1] 0.3841036
co2_ets$fitted[2:4]
27
Let’s plot the time series along with our forecast to see whether the above model adequately describes the
variation in our time series (it wouldn’t because we use “N” to denote the trend component, hence the model
has not considered a trend component yet):
plot(co2_ts)
lines(co2_ets$fitted, lty=2, col="dodgerblue4")
2.0
1.5
co2_ts
1.0
0.5
Time
It seems like our earlier suspicion was right - it didn’t seem to adequately “capture” the trend in our data.
Discussion: It looks like model="ANN" isn’t the right model specification for our time series. There exist an
additive trend component that wasn’t considered. How would you change the following code?
co2_ets<- ets(____, model="_____", alpha=0.2)
To see if your code works, plot the time series along with your forecasting line and compare to the original plot
(the one that systematically under-predicts the value of CO2 in Indonesia). Is your model an improvement?
### plot your new exponential smoothing model below and compare to the plot we created above
Reference answer: The forecast under-predicted most of the value because the trend component in our time
series wasn’t considered. Let’s change the model specification by including an additive trend type:
co2_ettrend <- ets(co2_ts, model="AAN", alpha=0.2)
plot(co2_ts)
lines(co2_ettrend$fitted, lty=2, col="dodgerblue4")
28
2.0
1.5
co2_ts
1.0
0.5
Time
Another way we can fit a simple exponential smoothing predictive model is by using R’s built-in HoltWinters()
function, setting FALSE for the parameters beta and gamma in the function call. The beta and gamma
parameters are used for Holt’s exponential smoothing, or Holt-Winters exponential smoothing, which we will
discuss in great details later in later sections of the coursebook. For now, let’s call HoltWinters() on our
co2 emission time series.
The HoltWinters() function will return a list variable that contains several named elements, which we will
discuss:
co2_hw <- HoltWinters(co2_ts, beta=F, gamma=F)
co2_hw
29
Since there is a prior knowledge that our time series does in fact contain a trend component, if you remove
beta=F from the function call and run the plot below again, you’ll see that we obtain a slightly better estimate.
For now, understand that the beta represents the coefficient for the trend component in our time series -
we’ll look at this with greater attention later.
plot(co2_hw, main="CO2 Emissions (ton) per capita, Indonesia")
1.5
1.0
0.5
Time
I’ve created another exponential smoothing model on rain_ts, and store it under the variable rain_hw.
Print rain_hw and compare it to the output from printing co2_hw.
rain_hw <- HoltWinters(rain_ts, beta=F, gamma=F)
Discussion: Compare the output and pay special attention to the estimated coefficient of the alpha parameter.
Recall from earlier sections (line ~470) that:
If α is small (i.e, close to 0), more weight is given to observations from the more distant past. If
α is large (i.e, close to 1), more weight is given to the more recent observations.
Between the CO2 emission and rainfall time series, which has a alpha that is closer to 0? An alpha close to 0
indicates that the forecasts are based on both recent observations and less recent observations while an alpha
close to 1 indicates that more weight is given to the more recent observations.
Reference answer:
co2_hw$alpha
## [1] 0.9999546
rain_hw$alpha
## [1] 0.02412151
30
By default, HoltWinters() performs its smoothing for the same time period covered by our original time
series. Because our original time series for Indonesia’s CO2 emissions is from 1970 to 2012, so the forecasts are
also for 1970 to 2012. The forecasts made by HoltWinters() are stored in a named element called “fitted”:
head(co2_hw$fitted, 10)
## Time Series:
## Start = 1971
## End = 1980
## Frequency = 1
## xhat level
## 1971 0.3119519 0.3119519
## 1972 0.3306206 0.3306206
## 1973 0.3580067 0.3580067
## 1974 0.3954686 0.3954686
## 1975 0.4021564 0.4021564
## 1976 0.4128045 0.4128045
## 1977 0.4612368 0.4612368
## 1978 0.6002915 0.6002915
## 1979 0.6677771 0.6677771
## 1980 0.6601461 0.6601461
As we’ve seen previously, we can also plot the original time series against the forecasts:
plot(co2_hw, main="CO2 Emissions (ton) per capita, Indonesia")
legend("topleft", legend = c("Observed", "Forecast"), fill=1:2)
Observed
Forecast
2.0
Observed / Fitted
1.5
1.0
0.5
Time
The forecasts is represented by the red line, and we see that it is smoother than the time series of the original
31
data. As a measure of the accuracy of the forecasts, we can calculate the sum of squared errors for the
in-sample forecast errors, which is stored in the named element SSE:
co2_hw$SSE
## [1] 0.6627841
So we mentioned earlier that by default HoltWinters() makes forecasts for the time period covered by
the original data, but to make forecasts for further time points we can use the forecast() function in the
forecast package, or using the generic built-in predict() function.
co2_fut <- predict(co2_hw, n.ahead=4)
co2_fut
## Time Series:
## Start = 2013
## End = 2016
## Frequency = 1
## fit
## [1,] 2.408917
## [2,] 2.408917
## [3,] 2.408917
## [4,] 2.408917
We can then use plot(), which accepts an additional predicted.values parameter to produce a chart of
the original time series along with the predicted values (n.ahead=4):
plot(co2_hw, co2_fut)
Holt−Winters filtering
2.0
Observed / Fitted
1.5
1.0
0.5
Time
32
Alternatively, using the forecast package (the point forecast we’ll get will be exactly the same as the generic
predict() method above):
library(forecast)
# forecast on 8 more further points (1913-1920, 8 more years)
co2_hwf <- forecast(co2_hw, h=4)
co2_hwf
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2013 2.408917 2.259399 2.558434 2.180250 2.637584
## 2014 2.408917 2.197472 2.620361 2.085540 2.732293
## 2015 2.408917 2.149953 2.667881 2.012866 2.804968
## 2016 2.408917 2.109892 2.707941 1.951598 2.866235
Recall that earlier in the lesson, I mentioned that a simple exponential smoothing without consideration
of any trend or seasonality is only really suitable for a time series that (no surprises) have no trend or
seasonality! Since the forecast is “told” that there is no trend and seasonal component, then it follows that
the prediction of future values follows a “flat” forecast function.
While unnecessary, we can verify that the future forecasted value does in fact follow the same exponential
smoothing formula we’ve learned about in the last few sections:
alp <- co2_hwf$model$alpha
round(
alp * co2_hw$x[43] +
alp * (1-alp) * co2_hw$x[42]
,4)
## [1] 2.4089
That turns out to be 2.4089, so the point forecast computed using the forecast() model is the same as our
manual calculation using the exponential smoothing formula.
The forecast error, or sum of squared errors, are stored in the named element residuals:
round(co2_hwf$residuals,3)
## Time Series:
## Start = 1970
## End = 2012
## Frequency = 1
## [1] NA 0.019 0.027 0.037 0.007 0.011 0.048 0.139 0.067 -0.008
## [11] -0.017 0.021 0.019 -0.018 0.030 0.040 -0.012 -0.005 0.037 -0.020
## [21] 0.090 0.149 0.105 0.066 -0.004 0.000 0.125 0.107 -0.352 0.138
## [31] 0.085 0.130 0.035 0.026 0.073 -0.001 -0.007 0.110 0.152 0.101
## [41] -0.097 0.566 0.075
Let’s manually verify that the sum of squared error if in fact the value returned from $SSE:
sum(as.numeric(co2_hwf$residuals)^2, na.rm=T)
## [1] 0.6627841
co2_hw$SSE
## [1] 0.6627841
We’ve learned, perhaps throughout the Machine Learning Specialization, that our forecast is rarely perfect
and for the most part is estimated with some errors. We saw above that these errors can be accessed with
co2_hwf$residuals, but if you want some confirmation of these values you could also have applied the
33
exponential smoothing formula manually and find the difference between these values and the corresponding
observed values (actual):
round(co2_ts[2:8] - (co2_ts[1:7] * co2_hwf$model$alpha),3)
Series co2_hwf$residuals
0.8
0.4
ACF
0.0
−0.4
0 5 10 15 20
Lag
acf is a built-in function that compute and (by default plots) the autocorrelation (default) or covariance of
our univariate time series. We specified 20 to be the maximum lag at which to calculate the acf. If this is the
first time you’ve across the word “lag” and autocorrelation, allow me to explain in greater details what they
mean.
You’ve learned that we use correlation or covariance to compare two series of values and see how similar two
time series are in their variation. Autocorrelation aims to describe the how similar the values of a time series
is with other values within it.
34
Lag can be understood as the delay, or the gap, between values. For lag 0, the autocorrelation is comparing
the entire time series with itself (and hence it is 1 in the ACF plot). For lag 1, the computation works by
shifting the time series by 1 before comparing that time series with the shifted-by-1 series and it continues
doing so for the whole length of the time series.
For a a timeseries that is “white noise”, where it comprises of completely random values throughout the
length of its series, we expect to see no correlation anywhere else apart from at lag 0. It is a useful technique
that reveals at what “lag” in the time series can we expect correlation.
Imagine stock prices of a public company over the last many years, sampled daily. You will find that the
time series will exhibit a relatively high correlation in lag=1 (because the values of one day to a great extent
correlates with the price tomorrow), but decrease quickly at lag=2. Supposed the company file its annual
returns every 12 months and we use a monthly time unit, then we should expect some correlation at lag=12
as well.
Going back to the ACF plot above, by default it plots the 95% confidence interval as dotted blue lines:
Observe that all sample autocorrelations except the one at 13 fall well inside the 95% confidence bounds
indicating the residuals appear to be random. By pure random chance (because of the 95% CI), you’d expect
to see one lag exceeding the 95% threshold for every 20 values.
##
## Autocorrelations of series 'co2_hwf$residuals', by lag
##
## 0 1 2 3 4 5
## 1.000 -0.152 -0.075 0.051 0.117 -0.028
Observe from the correlogram that ACF(0) = 1, because all data are perfectly correlated with themselves.
Our ACF(1) = -0.152, indicating that the correlation between a point and its next point is -0.152. ACF(3) is
0.051, indicating the correlation between a point and the point two lag away (or two time steps ahead) is
0.051 and so on.
So we’ve learned earlier that the ACF tells you how correlated the points are with each other in a given time
series, based on how many time steps they are separated by. Typically, we would expect the autocorrelation
function to fall towards 0 as points become more separated because it is as a rule of thumb harder to forecast
further into the future from a given point of time.
The ACF and its close relative, the partial autocorrelation function (PACF) are used in the Box-
Jenkins/ARIMA modeling approach to determine how past and future data points are related in a time series.
The PACF can be thought of as the correlation between two points but with the effect of the intervening
correlations removed. This is important because in reality, if each data point is only directly correlated with
the next data point and none other, that correlation may “dribble down” leading us to believe that there is a
correlation between the current point and a point down the future. For example, T1 is directly correlated
with T2, which is directly correlated with T3. This causes T1 to look like it is directly correlated with T3 -
so the PACF is a measure we use so any intervening correlation with T2 is eliminated leading us to better
discern true patterns of correlations.
In general, the “partial” correlation between two variables is the amount of correlation between them that
is not explained by their mutual correlations with a specified set of other variables. For example, if we are
35
regressing variable “Y” on variables X1, X2 and X3, the partial correlation of X3 and Y can be computed as
the square root of the reduction in variance that is achieved by adding X3 to the regression of Y on X1 and
X2.
pacf(co2_hwf$residuals, lag.max=20, na.action=na.pass)
Series co2_hwf$residuals
0.1 0.2 0.3
Partial ACF
−0.1
−0.3
5 10 15 20
Lag
Ljung-box Test
Now going back to the earlier correlogram we plotted: notice that the autocorrelation at lag 3 is just touching
the significance bounds - to test whether there is significant evidence for non-zero correlations at lag 1-20, we
can perform a Ljung-Box test using the Box.test() function. The maximum lag that we want to look at is
specified using the lag parameter. We will use the Ljung-Box test to determine whether our residuals are
random:
Box.test(co2_hwf$residuals, lag=20, type="Ljung-Box")
##
## Box-Ljung test
##
## data: co2_hwf$residuals
## X-squared = 17.942, df = 20, p-value = 0.5912
Here the Ljung-Box test statistic is 17.94, and the p-value is 0.6, so there is little evidence of non-zero
autocorrelations in the in-sample forecast errors at lag 1-20. The Ljung-Box test statistic (X-squared) gets
larger as the sample auto-correlations of the residuals get larger - without going too far beyond of this scope
of the coursebook, here’s the general idea:
For a p-value < 0.05 : You can reject the null hypothesis (no auto-correlation) and know that you have less
than 5% chance of making an error. This also mean we can assume that our values are showing dependence
36
on each other. So really we want to see large values.
For a p-value > 0.05 : Don’t have enough evidence to reject the null hypothesis (no correlation), so we assume
our values are random (no structure in the data) - or that there is no dependence on each other.
Lastly, to be sure that the predictive model cannot be furthed improved upon, let’s check that the errors are
approximately normally distributed with mean zero and constant variance:
hist(co2_hwf$residuals, breaks=20, xlim=c(-1,1))
abline(v=mean(co2_hwf$residuals, na.rm=T), col="goldenrod3", lwd=2)
Histogram of co2_hwf$residuals
10 12 14
Frequency
8
6
4
2
0
co2_hwf$residuals
37
0.6
0.4
co2_hwf$residuals
0.2
0.0
−0.2
Time
Observe that the in-sample forecast errors seem to have roughly constant variance over time, although the
size of the fluctuations at around 1998 and 2010 may be slightly larger than that at later dates.
Before we conclude, recall again from our Ljung-Box test that there is little evidence of non-zero correlations
in the in-sample forecast errors, and the distribution of our errors do also seem to approximate normal with
mean ~0: both of which suggests that the simple exponential smoothing do provide an adequate predictive
model for the Indonesia’s CO2 emission forecast. We’d reason that the assumptions that the 95% prediction
intervals were based upon are valid given: - No autocorrelations in the forecast errors
- Forecast error (residuals) are normally distributed
38
- bt = β(st − st−1 ) + (1 − β)bt−1
Recalling that α is our overall smoothing factor and β is our trend smoothing factor.
To forecast beyond xt :
Ft+m = st + mbt
Some suggestions to initialize b0 is to use:
xn −x1
- b1 = x1 − x0 - b1 = 13 [(x1 − x0 ) + (x2 − x1 ) + (x3 − x2 )] - b1 = n−1
Notice that in the first smoothing equation, we adjust st directly for the trend of the previous period (hence
adding st−1 + bt−1 ). This helps to eliminate the lag and brings st to the appropriate base of the current value.
The second smoothing equation then updates the trend, which is expressed as the difference between the last
two values (st − st−1 ). The equation is similar to the basic form of single smoothing, but here applied to the
updating of the trend.
If you have followed along the earlier exercise, you’ve re-written the following code (removing beta=FALSE
from the function call):
co2_holt <- HoltWinters(co2_ts, gamma=F)
co2_holt
Notice that the Holt’s Exponential Smoothing model estimates the parameters for the trend component
(beta) in our time series in addition to alpha, which we’re familiar with.
The estimated value of alpha is 0.75 and of beta is 0.11. Alpha is high - indicating that the estimate of
the overall smoothing are based more strongly on very recent observations in the time series; The trend
smoothing (slope) is estimated from observations that are recent as well as more distant and that makes
pretty good intuitive sense if we visually inspect the time series again:
39
plot(co2_holt)
Holt−Winters filtering
2.0
Observed / Fitted
1.5
1.0
0.5
Time
Observe that the in-sample forecasts agree pretty well with the observed values, although they tend to lag
behind the observed values a bit. If we want, we can specify the initial values of the level and slope b of the
trend component using the l.start and b.start arguments. It is common to set the initial value of the
level to the first value in the time series (recall s1 = x1 ) and the initial value of the slope to the second value
minus the first value (recall b1 = x1 − x0 ), so fitting a predictive model with initial values would look like this:
co2_holt2 <- HoltWinters(co2_ts, gamma=F, l.start=co2_ts[1], b.start=co2_ts[2]-co2_ts[1])
head(co2_holt2$x)
## Time Series:
## Start = 1970
## End = 1975
## Frequency = 1
## [1] 0.3119519 0.3306215 0.3580080 0.3954703 0.4021567 0.4128050
We can also make forecasts for future times not covered by the original time series using the
forecast.HoltWinters() function. Let’s make predictions for 2013 to 2020 (8 more data points),
and plot them:
co2_hwf2 <- forecast(co2_holt2, h=8)
plot(co2_hwf2)
40
Forecasts from HoltWinters
3.5
2.5
1.5
0.5
The plot gives us a forecast (blue line), a 80% prediction interval and a 95% prediction interval. Let’s apply
what we’ve learned earlier to see if the model could be improved upon by identifying whether the in-sample
forecast errors show any non-zero autocorrelations at lags 1-20 using a correlogram plot:
# remove the first two residuals values as they are NA (no residuals)
resids <- co2_hwf2$residuals[3:43]
acf(resids, lag.max=20)
41
Series resids
0.8
0.4
ACF
0.0
−0.4
0 5 10 15 20
Lag
Observe from the above correlogram that the in-sample forecast errors at lag 13 exceeds the significance
bounds. Again, we would expect one in 20 of the autocorrelations to exceed the 95% significance bounds by
chance alone.
Box.test(resids, lag=20, type="Ljung-Box")
##
## Box-Ljung test
##
## data: resids
## X-squared = 15.973, df = 20, p-value = 0.7183
Indeed, with the Ljung-Box test we obtain a p-value of 0.72, indicating that there is little evidence of a
non-zero autocorrelations in the in-sample forecast errors at lag 1-20.
We could further check that the forecast errors have constant variance over time and are normally distributed
with mean zero. We can use a time series plot and a histogram to inspect the distribution of the errors:
plot(resids, cex=0.5, pch=19)
abline(h=0, col="red", lwd=3)
42
0.4
0.2
resids
0.0
−0.2
−0.4
0 10 20 30 40
Index
Observe that the errors do seem to have roughly constant variance over time and that that our errors are
normally distributed with mean zero and almost constant variance.
To summarize, our Ljung-Box test shows that there is little evidence of autocorrelations in the forecast
errors, while the time plot and histogram of forecast errors show that it is plausible that the forecast
errors are normally distributed with mean zero and constant variance. We can therefore conclude that our
double exponential smoothing (Holt’s exponential smoothing) provides an adequate predictive model for skirt
diameters.
43
100000
souvenir_ts
60000
20000
0
Time
It looks like it’s a multiplicative time series - we’ll apply a log-transformation on the series and apply Holt
Winters on the resulting series:
souv_hw <- HoltWinters(log(souvenir_ts))
souv_hw
44
## s8 0.10147055
## s9 0.09649353
## s10 0.05197826
## s11 0.41793637
## s12 1.18088423
The estimated values of alpha (0.41) is relatively low, indicating that the estimate of the level at the current
time point is based upon both recent observations and some observations in the more distant past.
The value of beta is 0.00 indicating that the estimate of the slope b of the trend component is not updated
over the time series and instead is set equal to its initial value. This makes good intuitive sense as the level
changes quite a bit over time but the slope b of the trend component remains roughly the same:
plot(souv_hw)
Holt−Winters filtering
11
Observed / Fitted
10
9
8
Time
In contrast, the value of gamma (0.96) is high, indicating that the estimate of the seasonal component at the
current time point is just based upon very recent observations.
From the forecast plot above we see that the Holt-Winters exponential method is very successful in predicting
the seasonal peaks, which occur roughly in November every year.
Understanding Holt-Winters
Holt-Winters method, also known as the triple exponential smoothing was first suggested by Holt’s student
Peter Winters and introduces a third equation in addition to the two in Holt’s method to account for
seasonality. Generally, to produce forecasts (assuming additive seasonality), the formula would be:
Forecast = Most recent estimated level + trend + seasonality
And for multiplicative seasonality, we would instead use:
45
Forecast = Most recent estimated (level + trend) * seasonality
xt
The equations for multiplicative seasonality are: - Overall smoothing: st = α It−L + (1 − α)(st−1 + bt−1 )
- Trend smoothing: bt = β(st − st−1 ) + (1 − β)bt−1
- Seasonal smoothing: It = γ xstt + (1 − γ)It−L
-Forecast: Ft+m = (st + mbt )It−L+m
Where alpha, beta and gamma are estimated in a way that minimizes the MSE of the error, and: - x is the
observation
- s is the smoothed observation
- b is the trend factor
- I is the seasonal index (sequence)
- F is the forecast at m periods ahead
- t is an index denoting a time period
- L is the number of periods in a season
Notice the It−L in our formula: this makes sense because It is the sequence of seasonal correction factors so
a t value of 6 in a 4-period season (Q1 to Q4) would refer to the seasonal correction factor of the second
period(Q2).
A complete season’s data consists of L periods, and we need to estimate the trend factor from one period to
the next. To accomplish this, it is advisable to have at least two complete seasons.
The general formula to estimate the initial trend b:
b = L1 ( xL+1L−x1 + xL+2L−x2 ... + xL+LL−xL )
Now onto finding the initial values for the seasonal indices. To make things less abstract, assume we’re
working with data that consist of 5 years and 4 periods (4 Quarters per year):
• Step 1: Compute the averages of each of the 5 years (A1 ...A5 )
Year 1
Year 2
Year 3
Year 4
Year 5
1
y1/A1
y5/A2
y9/A3
y13/A4
46
y17/A5
2
y2/A1
y6/A2
y10/A3
y14/A4
y18/A5
3
y3/A1
y7/A2
y11/A3
y15/A4
y19/A5
4
y4/A1
y8/A2
y12/A3
y16/A4
y20/A5
• Step 3: Now the seasonal indices are formed by computing the average of each row. Thus the seasonal
indices are computed:
I1 = (y1 /A1 + y5 /A2 + y9 /A3 + y13 /A4 + y17 /A5 )/5
(substitute I1 for It )
47
We obtain the forecast (blue line) with a 80% (dark grey) and 95% (light grey) prediction intervals, respectively.
We’ll also check that the predictive model cannot be improved upon by observing if the in-sample forecast
errors show non-zero autocorrelations at lag 1-20 and performing the Ljung-Box test:
resids2 <- souv_hwf$residuals[!is.na(souv_hwf$residuals)]
acf(resids2, lag.max=20)
Series resids2
1.0
0.8
0.6
ACF
0.4
0.2
−0.2
0 5 10 15 20
Lag
##
## Box-Ljung test
##
## data: resids2
## X-squared = 17.53, df = 20, p-value = 0.6183
The correlogram shows that the autocorrelations for the in-sample forecast errors do not exceed the significance
bounds for lags 1-20. Furthermore, the p-value from our Ljung-Box test is 0.62, indicating that there is little
evidence of a non-zero autocorrelations at lags 1-20. We’ll also use a time plot to check that the forecast
errors follow a normal distribution with mean zero and roughly constant variance:
plot(resids2, pch=19, cex=0.5, ylim=c(-1, 1))
abline(h=0, col="dodgerblue4")
48
1.0
0.5
resids2
0.0
−0.5
−1.0
0 10 20 30 40 50 60 70
Index
49
Histogram of resids2
10
8
Frequency
6
4
2
0
resids2
It appears plausible from both our plots that the forecast errors do follow these assumptions – combined with
the Ljung-Box test and acf plot, we may say that the Holt-Winters exponential smoothing appear to provide
an adequate predictive model for the log of sales at the souvenir shop and cannot be improved upon.
ARIMA Models
Stationarity and Differencing
An important concept when dealing with time series is the idea of stationarity. A stationary time series is one
whose properties do not depend on the time at which the series is observed, so a time series with trends or
seasonality are not stationary - the trend and seasonality will affect the value of the time series at different
times. A white noise series on the other hand is stationary.
Some cases can be less clear-cut: a time series with cyclic behavior (but not trend or seasonality) is stationary.
This is because the cycles are not of fixed length so before we observe the series we cannot be sure where the
peaks and troughs of the cycles will be. In general, a stationary time series will have no predictable patterns
in the long term and their plots will show roughly constant variance.
Quiz: Which of the three time series are stationary?
library(fpp)
#op <- par(mfrow=c(2,2), pty="s", mar=c(0,0,0,0))
#par(mfrow=c(4,1))
50
plot(a10.ts)
title("Anti-diabetic drug sales, Aus")
15
10
5
Time
plot(wmurders.ts)
title("Female murder per 100000, US")
51
Female murder per 100000, US
4.5
4.0
wmurders.ts
3.5
3.0
2.5
Time
plot(debit.ts)
title("Debit card usage (mil), Iceland")
52
Debit card usage (mil), Iceland
25000
20000
debit.ts
15000
10000
Time
Obvious seasonality rules out the debit card usage time series and anti-diabetic drug sales, while trend rules
out female murder rate series. In addition to seasonality, increasing variance in the anti-diabetic drug sales
series also violated the assumptions required for stationarity.
Let’s take the anti-diabetic drug sales series and see if we can make it stationary: a common technique that
springs to mind is to deal with non-constant variance by taking the logarithm or square root of the series, so
we’ll try that.
Apart from taking the log, another common technique to deal with non-stationary data is to difference the
data. That is, given the series Zt , create the new series:
Yi = Zi − Zi−1
The differenced data will contain one less point than the original data. Although we can difference the data
more than once, one difference is usually sufficient.
See diff() in action:
prices <- c(10, 4, 17, 3, 1)
diff(prices)
## [1] -6 13 -14 -2
Let’s plot the original diabetic drug sales series, the log-transformed series, and the differenced series:
par(mfrow=c(3,1))
plot(a10.ts)
plot(log(a10.ts))
53
plot(diff(log(a10.ts), lag=12), xlab="Year")
a10.ts
Time
log(a10.ts)
1.0
Time
diff(log(a10.ts), lag = 12)
−0.1
Year
Transformations such as logarithms help to stabilize the variance of a time series and techniques such as
differencing (compute differences between consecutive observations) help stabilize the mean of a time series
by removing changes in the level of a time series and so eliminating trend and seasonality.
Another technique that is helpful is the ACF plot we’ve learned about earlier: The ACF of a stationary
time series will drop to zero relatively quickly while the ACF of a non-stationary data decreases
slowly:
par(mfrow=c(1,2))
acf(a10.ts)
acf(diff(log(a10.ts), lag=12))
ACF
−0.2
0.0
Lag Lag
54
Second order differencing:
prices <- c(10, 4, 17, 3, 1)
prices
## [1] 10 4 17 3 1
diff(prices)
## [1] -6 13 -14 -2
diff(diff(prices))
## [1] 19 -27 12
Seasonal difference is the difference between an observation and the corresponding observation from the
previous season, and are also called “lag-m differences” as we subtract the observation after a lag of m periods.
elec_ts <- ts(usmelec, frequency=12, start=c(1973,1))
elec_log <- log(elec_ts)
elec_seasondif <- diff(log(elec_ts), lag=12)
elec_doub <- diff(elec_seasondif, lag=1)
Time
Sometimes it is necessary to do both a seasonal difference and a first difference to obtain stationary data
as shown in the panel above. When both seasonal and first differences are applied it makes no difference
which is done first as the result will be the same. However, if the data have a strong seasonal pattern, it is
55
recommended that seasonal differencing be done first because sometimes the resulting series will be stationary
and there will be no need for a further first difference.
Augmented Dickey-Fuller (ADF) test:
Another way to determine more objectively if differencing is required is to use a unit root test and one of the
most popular test is the Augmented Dickey-Fuller (ADF) test.
The null-hypothesis for an ADF test is that the data are non-stationary, so large p-values are indicative of
non-stationarity and small p-values suggest stationarity. Using the usual 95% threshold, differencing is
required if the p-value is greater than 0.05.
adf.test(elec_ts, alternative = "stationary")
##
## Augmented Dickey-Fuller Test
##
## data: souvenir_ts
## Dickey-Fuller = -2.0809, Lag order = 4, p-value = 0.5427
## alternative hypothesis: stationary
Discuss: Between the two time series, which one is not stationary? How would you transform the time series
into one that is stationary? Use the Augmented Dickey-Fuller Test to confirm that your transformed time
series is in fact now stationary:
### Your answer below:
Autoregressive models
Recall from our regression models class that a multiple regression model “forecast” the variable of interest
using a linear combination of predictors. In an autoregressive model, we forecast the variable of interest using
a linear combination of past values of the variable. The term autoregression indicates that it is a regression of
the variable against itself, just as the term autocorrelation refers to the practice of performing correlation
on a time series against itself. Personally, I find it easier to think of the term “auto” in both cases as in
“autonomous” rather than “automatic”.
Given that definition, an autoregression model of order p can be described as:
yt = c + φ1 yt−1 + φ2 yt−2 + ... + φp yt−p + et
Where c is a constant and e is white noise. This is like a multiple regression but with lagged values of yt as
predictors. We refer to this as an AR(p) model. Note that changing the parameters φ1 , ...φp will result in
different time series patterns, and also that the variance of the error term et will only change the scale of the
series but not the patterns.
For an AR(1) model:
- When φ1 = 0, yt is equivalent to white noise (c + et ) - When φ1 = 1 and c = 0, yt is equivalent to a random
walk
56
We normally restrict autoregressive models to stationary data and then some constraints on the values of the
parameters are required:
- For an AR(1) model: −1 ≤ φ1 < 1
- For an AR(2) model: −1 ≤ φ2 < 1, φ1 + φ2 < 1, φ1 − φ2 < 1
To see an example of AR in action, we’ll use the time series for monthly total generation of electricity. The
length of our time series is 454, and ar() will select the optimal value of order if we don’t specify a maximum
order by using the optional order.max parameter:
elec_ar <- ar(elec_ts)
elec_ar
##
## Call:
## ar(x = elec_ts)
##
## Coefficients:
## 1 2 3 4 5 6 7 8
## 0.9307 -0.0796 0.0169 -0.1177 0.1437 0.0353 0.0497 -0.0769
## 9 10 11 12 13 14 15 16
## -0.0105 0.0651 -0.0349 0.4881 -0.4270 0.0349 -0.1109 0.1631
## 17 18 19 20 21 22 23 24
## -0.0370 -0.1711 0.0755 0.0337 -0.0407 -0.0034 0.0604 0.2555
## 25
## -0.2533
##
## Order selected 25 sigma^2 estimated as 143.1
When p >= 3, the restrictions are much more complicated - fortunately R takes care of these restrictions
when estimating a model.
57
Selecting appropriate values for p, d and q can be difficult but fortunately for us the auto.arima() function
will do it for us.
The Nitrous Oxide emissions into Indonesia’s atmosphere is a time series with no seasonality and no trend;
Nitrous oxide emissions are emissions from agricultural biomass burning, industrial activities, and livestock
management. Take a look at the observations over the years:
co2$Nitrous.oxide.emissions..thousand.metric.tons.of.CO2.equivalent.
plot(co2$Nitrous.oxide.emissions..thousand.metric.tons.of.CO2.equivalent., type="l")
350000
250000
150000
50000
0 10 20 30 40
Index
We’ll first create a time series object using observations from that data frame, specifying the start and
frequency:
agri_ts <- ts(co2$Nitrous.oxide.emissions..thousand.metric.tons.of.CO2.equivalent., start=1970, frequenc
And use auto.arima on the time series, and since we’ve observed earlier that there seems to be no seasonal
pattern, we’ll restrict the search to non-seasonal models using seasonal=F:
agri_arima <- auto.arima(agri_ts, seasonal=F)
agri_arima
## Series: agri_ts
## ARIMA(0,1,1)
##
## Coefficients:
## ma1
## -0.8244
## s.e. 0.0842
##
## sigma^2 estimated as 3210286376: log likelihood=-519.34
58
## AIC=1042.68 AICc=1042.99 BIC=1046.16
The output of auto.arima suggests an ARIMA(0,1,1), so that’s equivalent to a one-order differencing with a
MA(1) model.
Discussion: usconsumption is a multivariate time series measuring percentage changes in quarterly personal
consumption expenditure and personal disposable income for the US from 1970 to 2010:
head(usconsumption)
## consumption income
## 1970 Q1 0.6122769 0.496540
## 1970 Q2 0.4549298 1.736460
## 1970 Q3 0.8746730 1.344881
## 1970 Q4 -0.2725144 -0.328146
## 1971 Q1 1.8921870 1.965432
## 1971 Q2 0.9133782 1.490757
When we plot the multivariate time series, we’ll see the following pattern:
uscons_ts <- ts(usconsumption, start = c(1970,1), frequency = 4)
plot(uscons_ts)
uscons_ts
2
consumption
0
4 −2
income
2
0
−2
Time
1. Do you think the time series is stationary? (Hint: I mentioned earlier that it is the percentage changes
in quarterly PCE and PDI, strongly suggesting that it is a differenced series)
2. Is the time series seasonal?
3. If the time series is stationary (Question 1), what do you think the value of I would be in ARIMA?
59
Observe from the plot (quarterly percentage changes in US consumption expenditure and income) above -
there appears to be no seasonal pattern so we will fit a non-seasonal ARIMA model:
uscons.arima <- auto.arima(uscons_ts[,1], seasonal=F)
uscons.arima
## Series: uscons_ts[, 1]
## ARIMA(0,0,3) with non-zero mean
##
## Coefficients:
## ma1 ma2 ma3 mean
## 0.2542 0.2260 0.2695 0.7562
## s.e. 0.0767 0.0779 0.0692 0.0844
##
## sigma^2 estimated as 0.3953: log likelihood=-154.73
## AIC=319.46 AICc=319.84 BIC=334.96
The result suggested an ARIMA(0,0,3) model, recall this is equivalent to a MA(3) model:
yt = 0.7562 + et + 0.2542et−1 + 0.2260et−2 + 0.2695et−3
where 0.7562 is the constant term and et is white noise with standard deviation 0.62 (or: sqrt(0.3953)).
We can now plot our forecast, specifying h=10 to forecast for 10 future values and including past 95 observations
in the plot.
plot(forecast(uscons.arima, h=10), include=100)
To tell whether values of p and q chosen are appropriate for the time series we can try applying the ACF and
PACF plot. Recall that an ACF plot shows the autocorrelations which measures the relationship between
yi and yt−k , now if yt and yt−1 are correlated then yt−1 and yt−2 must also be correlated. But then yt and
yt−2 may be correlated simply because they are both connected to yt − 1 rather than because of any new
information that can be useful to our forecast model.
To overcome this problem, we use partial autocorrelations to help us measure the relationship between yt
and yt−k after removing the effects of other lags between t and k.
Pacf(uscons_ts[,1], main="")
60
Partial ACF
−0.2
4 8 12 16 20
Lag
√
The partial autocorrelations have the same critical values of ±1.96/ T where T is the number of points in
the time series. In here, the critical values are 0.15 and -0.15 respectively.
1.96/sqrt(length(uscons_ts[,1]))
## [1] 0.1530503
-1.96/sqrt(length(uscons_ts[,1]))
## [1] -0.1530503
Observe the pattern in the first three spikes from the PACF plot - this is what we would expect from an
ARIMA(0,0,3) as the PACF tends to decay exponentially. In this case, the PACF lead us to the same model
as the automatic procedure we perform using auto.arima above (both suggested ARIMA(0,0,3)).
How auto.arima works behind the scene is a combination of KPSS tests for choosing the parameter d and
choosing p and q to minimize the AIC - the details of which is beyond the scope of this coursebook. However,
if you want to fit a model using your own parameters, you can do so using the Arima() function. To fit a
ARIMA(0,0,3) model to the US consumption data:
uscons_arima2 <- Arima(uscons_ts[,1], order=c(0,0,3))
uscons_arima2
## Series: uscons_ts[, 1]
## ARIMA(0,0,3) with non-zero mean
##
## Coefficients:
## ma1 ma2 ma3 mean
## 0.2542 0.2260 0.2695 0.7562
## s.e. 0.0767 0.0779 0.0692 0.0844
##
## sigma^2 estimated as 0.3953: log likelihood=-154.73
## AIC=319.46 AICc=319.84 BIC=334.96
This gives us exactly the same result as the auto.arima() function call we made above.
Let’s summarize the general approach to fitting an ARIMA model:
1. Plot the data; Identify any unusual observations
2. If necessary, transform the data (Box-Cox Transformation) to stabilize the variance
3. If data are non-stationary, take first differences of the data until data are stationary
4. Examine the ACF / PACF: Is an AR(p) or MA(q) model appropriate?
5. Try your chosen model(s)
6. Check the residuals from our models by plotting the ACF of the residuals - they have to look like white
noise
7. If residuals look like white noise, calculate forecasts
61
The automated algorithm only takes care of steps 3-5, so we’ll have to do the rest manually!
96
94
92
90
Year
This is a plot of the quarterly retail trade index in the Euro area covering wholesale and retail trade and
repair of motor vehicles. The data is clearly non-stationary with some seasonality, so let’s take a seasonal
62
difference and plot the seasonally differenced data:
tsdisplay(diff(euretail, 4)) # quarterly differenced
diff(euretail, 4)
0 2
−3
0.2 0.6
PACF
ACF
−0.4
−0.4
5 10 15 5 10 15
Lag Lag
63
0.0
−2.0
0.4
0.0
0.0
PACF
ACF
−0.4
−0.4
5 10 15 5 10 15
Lag Lag
Our aim now is to find an appropriate ARIMA model based on the ACF and PACF shown. The significant
spike at lag 1 in the ACF and PACF suggests a non-seasonal MA(1) component, and the significant spike
at lag 4 (and to a lesser extent at lag 8, lag 12, lag 16) in the PACF suggest a seasonal MA(1) component.
Consequently, we begin with an ARIM A(0, 1, 1)(0, 1, 1)4 model, indicating a first and seasonal difference,
and non-seasonal and seasonal MA(1) components. The residuals for the fitted model are:
euretail_arima <- Arima(euretail, order=c(0,1,1), seasonal=c(0,1,1))
tsdisplay(euretail_arima$residuals)
64
0.5
−1.0
euretail_arima$residuals
0.4
PACF
ACF
0.0
0.0
−0.4
−0.4
5 10 15 5 10 15
Lag Lag
Both the ACF and PACF show significant spikes at lag 2, and almost significant spikes at lag 3, indicating
some additional non-seasonal terms need to be included in the model. Using the Ljung-Box test we try a few
other models and arrive at a lowest AIC value with the ARIM A(0, 1, 3)(0, 1, 1)4 model:
euretail_final <- Arima(euretail, order=c(0,1,3), seasonal=c(0,1,1))
##
## Box-Ljung test
##
## data: euretail_final$residuals
## X-squared = 7.0105, df = 16, p-value = 0.9731
Now that we have a seasonal ARIMA model that passes the required checks, we’ll plot the forecasts from the
model for the next 3 years. Notice that the forecasts follow the recent trend in the data:
plot(forecast(euretail_final, h=12))
65
Forecasts from ARIMA(0,1,3)(0,1,1)[4]
100
95
90
## Series: euretail
## ARIMA(1,1,2)(0,1,1)[4]
##
## Coefficients:
## ar1 ma1 ma2 sma1
## 0.7345 -0.4655 0.2162 -0.8413
## s.e. 0.2239 0.1995 0.2096 0.1869
##
## sigma^2 estimated as 0.1592: log likelihood=-29.69
## AIC=69.37 AICc=70.51 BIC=79.76
This model have a larger AIC value than the one we manually fitted (69.37 compared to 67.4), as well as
a slightly-higher residual sum of squares (8.754 vs 8.6) - this is because auto.arima takes some short-cuts
in order to speed up the computation and is not guaranteed to return the best model. We could turn the
short-cuts off when calling the auto arima function:
auto.arima(euretail, stepwise=F, approximation=F)
## Series: euretail
## ARIMA(0,1,3)(0,1,1)[4]
##
## Coefficients:
## ma1 ma2 ma3 sma1
## 0.2625 0.3697 0.4194 -0.6615
66
## s.e. 0.1239 0.1260 0.1296 0.1555
##
## sigma^2 estimated as 0.1564: log likelihood=-28.7
## AIC=67.4 AICc=68.53 BIC=77.78
This time it returned the same model we had identified by hand:
euretail_final
## Series: euretail
## ARIMA(0,1,3)(0,1,1)[4]
##
## Coefficients:
## ma1 ma2 ma3 sma1
## 0.2625 0.3697 0.4194 -0.6615
## s.e. 0.1239 0.1260 0.1296 0.1555
##
## sigma^2 estimated as 0.1564: log likelihood=-28.7
## AIC=67.4 AICc=68.53 BIC=77.78
As an exercise, try and change the final line of code to apply.quarterly(), name it webtfq and print the
first 6 values of webtfq
# Your answer below:
67
Monthly Website Visitors since Jan 2014 2014−01−31 / 2016−12−25
300 300
250 250
200 200
150 150
Jan 2014 Jul 2014 Jan 2015 Jul 2015 Jan 2016 Jul 2016 Dec 2016
68
Forecasts from ARIMA(4,1,0)
2500
2000
1500
1000
Another tip I want to give you is on the use of quantmod to extract stock data. R’s quantmod package made
this a breeze. When I execute the following code chunk, an object named AAPL will be created containing the
stock we specified through the symbol:
options("getSymbols.warning4.0"=FALSE)
library(quantmod)
start <- as.Date("2014-09-10")
end <- as.Date("2018-09-09")
# Let's get Apple stock data; Apple's ticker symbol is AAPL. We use the
# quantmod function getSymbols, and pass a string as a first argument to
# identify the desired ticker symbol, pass 'yahoo' to src for Yahoo!
# Finance, and from and to specify date ranges
# The default behavior for getSymbols is to load data silently (and directly) into the
# global environment, with the object being named after the loaded ticker
# symbol.
Now that we’ve learned about moving averages, let’s do a fun practice of adding a SMA to our bar plot.
69
Recall that the bigger n is for moving averages, the smoother the line:
barChart(AAPL, theme='white.mono')
addSMA(n=15, col = "goldenrod4")
One last tip: When you plot a multivariate time series earlier, it plots each time series separately
(plot.type="multiple") but if you prefer to have time plotting as a single plot, you can specify that:
## Multivariate
z <- ts(matrix(rnorm(300), 100, 3), start = c(1961, 1), frequency = 12)
class(z)
z
2
Series 1
1
0
3 −2
2
Series 2
1
0
2 −2
Series 3
1
−1 0
−3
Time
70
plot(z, plot.type = "single", lty = 1:3, col=2:4)
3
2
1
z
0
−1
−2
−3
Time
Graded Quizzes
This section of the workshop needs to be completed in the classroom in order to obtain a score that count
towards your final grade.
Learn-by-Building
This week’s learn-by-building may require a non-trivial amount of pre-processing so I recommend you schedule
extra sessions with your mentor or work in a group if necessary. Download the dataset from Chicago Crime
Portal, and use a sample of these data to build a forecasting project where you inspect the seasonality and
trend of crime in Chicago. Submit you project in the form of a RMD format, and address the following
question:
• Is crime generally rising in Chicago in the past decade (last 10 years)?
• Which time series method seems to capture the variation in your time series better? Explain your
choice of algorithm and its key assumptions
Student should be awarded the full (3) points if they address at least 2 of the above questions. The questions
are by no means definitive, but can be used as a “guide” in the preparation of your project. The data contains
a variety of offenses, but you can sample only a type of crime you’re interested in (eg. theft, narcotic, battery
71
etc). Use visualization if it helps support your narrative.
Alternatively, read on in the Extra Content and peek at the wiki.R script that I gave. Adjust the script
to scrap some data relating to Wikipedia views on any subject that you are interested in (choose between
the English or Bahasa Indonesia version of Wikipedia). Produce a well-formatted RMD report that outline
your findings and forecasts. Use any forecasting or time series method and argue your case. Show at least 1
visualization to complement your writing.
Student should be awarded the full (3) points if the report displays: - Appropriate use of any time series
method
- Visualization
- Analysis on the trend (are more people interested in that subject over time? Are people reading about data
science algorithms on Monday the most? Are Indonesians considerably more interested in the history and
definition of Bhinneka Tunggal Ika as we near Independence Day each year?) Show originality and your own
analysis / opinion on the matter supported by a simple visualization of the trend.
Extra Content:
Using the wikipediatrend library, I’ve downloaded the daily wikipedia views on the English9 and Bahasa
Indonesia10 version of Joko Widodo’s article. The full R code is in wiki.R. For convenience, I’ve saved a
copy of that dataset into your data_input directory:
jokowi_wiki <- read.csv("data_input/wikiviews.csv")
str(jokowi_wiki)
Using Prophet
Prophet is a forecasting tool released and made open source by Facebook’s Core Data Science team. The team
introduced it as “a procedure for forecasting time series data based on an additive model where non-linear
trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series
that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and
shifts in the trend, and typically handles outliers well”.
Let’s load the library into our environment and run the prophet forecaster of the dataframe. A requirement
of this package is the necessary preparation to obtain a dataframe with a ds column and a y column that
correspond to a valid date type and the time series, respectively.
9 Joko Widodo, Wikipedia (English)
10 Joko Widodo, Wikipedia (Bahasa Indonesia)
72
library(prophet)
## ds
## 1518 2019-08-26
## 1519 2019-08-27
## 1520 2019-08-28
## 1521 2019-08-29
## 1522 2019-08-30
## 1523 2019-08-31
Let’s now use predict, passing in our model and our dataframe. This is a process that you should be very
familiar by now. From the resulting dataframe (we name it jokowi_fc), we have the predicted
jokowi_fc <- predict(jokowi_model, jokowi_fd)
str(jokowi_fc)
73
actual_pred <- cbind(jokowi_en, jokowi_fc[1:nrow(jokowi_en), c("yhat_lower", "yhat", "yhat_upper")])
head(actual_pred)
5000
4000
3000
y
2000
1000
As with the case of most time series techniques we’ve learned so far, the package also supports decomposition
that ease our task of breaking down trends in various degree of resolution:
prophet_plot_components(jokowi_model, jokowi_fc)
74
3000
2500
trend
2000
1500
1000
2016 2017 2018 2019
ds
100
50
weekly
0
−50
−100
200
yearly
−200
What did we learn from the decomposition above? Which day of week are people more likely to read facts
and information about our President and what time of year happen to be more popular. As a homework, do
you suspect the same trend for other politicians in Indonesia? Do you see a similar consumption behavior
when we perform the analysis on the “Bahasa Indonesia” version of this article?
Annotations
75