Master Thesis Jimmy
Master Thesis Jimmy
Master Thesis Jimmy
multiple locations
Jimmy Wilde
Master Thesis
Autumn 2021
Acknowledgements
I gratefully thank Professor Petros Dellaportas for his help all along with my
master thesis. Given the current sanitary situation, we have not been able to
work alongside physically. Nevertheless, he has always taken the time to help me
remotely and gave me even more motivation for a fascinating project.
1
Contents
1 Preliminaries 6
2 Dataset 10
3 Short-term prediction 16
3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2
3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Bibliography 39
3
Introduction
Well-suited models able to forecast different characteristics of the wind have been
deeply studied during the last decades. Indeed, since wind power is one of the
most attractive renewable energy (high efficiency and low pollution) [1], many
scientists have focused their research on this topic. There exist multiple fore-
casting methods, in addition to statistical methods, like using numerical weather
prediction (NWP) which are meteorological or geographic data like temperature,
pressure, surface roughness, and obstacles [2].
In this study, we are focusing on the forecasting of wind speed and direction tu-
ple for multiple locations using only statistical and deep learning methods. This
means that no additional information, meteorological or geographic, will be used
to perform the predictions. We call the tuple composed of information on the
wind speed and its direction: the wind velocity vector. Most of the literature
nowadays is concentrated on the forecast of the wind speed alone (like, for exam-
ple, Palomares-Sala 2009 [3]), or of the wind power but not necessarily of the wind
velocity vector. However, Erdem & Shi 2011 [4] focused their research on the fore-
casts of wind velocity vectors for short-term prediction. Since the wind speed and
direction have good succession and randomness, it was appropriate for them to
use ARMA and VAR models of time series to predict the wind velocity vector for a
small number of data points ahead. The dataset we will use is composed not only
of the wind velocity information in one location but in multiple locations, hence
we will work not uniquely with single location forecasting but also with multiple
locations forecasting. Our goal is to search different models that could predict
wind velocity vectors as well as classical ARMA and VAR models but also, that
could use the intrinsic spatial relation of the data to possibly outperform these
models. The intuition is based on the fact that, for close locations at least, there
should exist a causality between the value of the time series in location A and the
past and/or future values of the time series in location B.
Unfortunately, ARMA and VAR-based models are not adequate candidates for
this task because of their quick convergence to the stationary distribution of the
time series process. Deep learning models are a valid alternative though. Some
research on different artificial neural networks to predict the mean monthly speed
4
using neighboring stations has already been done by Bilgili 2017 [5]. Similarly, we
intend to find deep learning models to forecast the wind velocity vector, but for
a forecast range between 12 and 24 hours.
The remainder of this thesis is organized as follows. In the next section, we re-
call to the reader the main mathematical concepts used throughout the research.
Then, the dataset will be presented from its raw form to its final form on which
the computations will be made. In the third section, the discussion focuses on
the models adequate for short-term forecasting and their performances. In the
section afterward, multi-step deep learning models are presented, and discussion
is provided on their results. In the last section, the main findings are summarized.
For reproducibility purposes, a GitHub project was created, containing all the
notebooks and scripts designed throughout this project. All the libraries used
were updated to the latest version available at the current time of the project.
5
Chapter 1
Preliminaries
In this chapter, we review the basic theory of time series, (V)ARMA models, and
artificial neural networks.
Definition 1.1. A time series is a set of observations {Xt }t∈T with Xt ∈ R, each
recorded at time t ∈ T . In time series analysis the index set T ∈ N is a set of
time points.
Definition 1.2. The time series {Xt }t∈T is said to be weakly stationary if for all
n ∈ N∗ , for any t1 , . . . , tn ∈ T and for all τ such that t1 + τ, . . . , tn + τ ∈ T all the
joints moments of order 1 and 2 of Xt1 , . . . , Xtn exist, are finite and are equal to
the corresponding joints moments of Xt1 +τ , . . . , Xtn +τ .
We can go beyond the first two moments and define strong stationarity.
Definition 1.3. The time series {Xt }t∈T is said to be strongly stationary if for
all n ∈ N∗ , for any t1 , . . . , tn ∈ T and for all τ such that t1 + τ, . . . , tn + τ ∈ T
all the joint distribution of Xt1 , . . . , Xtn is the same as the joint distribution of
Xt1 +τ , . . . , Xtn +τ .
In the rest of the thesis, when the term stationarity is used, it will refer to either
weak or strong stationarity by language abuse. Different methods can be used
to assert the stationarity of the time series. For this research, we will use the
Augmented Dickey-Fuller (ADF) test [6]. Its null hypothesis H0 is that the time
series is non-stationary.
6
Similarly, to check the causation between multiple time series we will refer to
Granger’s causality test [7]. This test’s null hypothesis is that the coefficients of
past values in the regression equation are zero, or in other terms that the past
values of a time series {Xt }t∈T are not causing values of the time series {Yt }t∈T .
Suppose that we need to test causality for multiple time series. The more infer-
ences are made, the more likely erroneous inferences become. Several statistical
techniques have been developed to address that problem, like the Bonferroni cor-
rection which is the most classical one.
Definition 1.4. Let H1 , . . . , Hm be a family of hypotheses and p1 , . . . , pm their
corresponding p-values. Let m be the total number of null hypotheses. The
familywise error rate (FWER) is the probability of rejecting at least one true Hi .
The Bonferroni correction rejects the null hypothesis for each pi ≤ m α
, thereby
controlling the FWER at ≤ α.
for t ∈ N. {βi }pi=1 are the coefficient parameters of the auto-regressive (AR)
model, {ϕj }j=1 are the coefficient parameters of the moving average (MA) model,
q
{ϵk }tk=1 are the error terms and α ∈ R is the constant parameter.
To determine the appropriate lags p and q of the ARMA models, one should
analyze the plot of the two following functions.
Definition 1.6. For a stationary time series Xt we define the Auto Correlation
Function (ACF) by
E[(Xt − E[Xt ])(Xt+τ − E[Xt ])]
ρτ =
Var(Xt )
Similarly, the Partial Auto Correlation Function (PACF) gives the partial correla-
tion of a stationary time series Xt with its own lagged values, regressed the values
of the time series at all shorter lags.
Definition 1.7. The Vector ARMA (VARMA) process is a multivariate version
of the ARMA model. Let Yt ∈ Rn , n ∈ N∗ be a vector of time series. The VARMA
model with parameters (p,q)∈ R2 is expressed as follows [8]:
Φp (B)Yt = δ + Θq (B)εt
7
where δ ∈ Rn is the constant parameter and B is the backward shift operator
such that B k × Yt = Yt−k for k = 1, . . . , t. The auto-regressive term of order p is
Φp (B) = I − Φ(1) B 1 − . . . − Φ(p) B p with Φ(i) ∈ Rn×n for 1 ≤ i ≤ p and similarly
the moving average term of order q is Θq (B) = I − Θ(1) B 1 − . . . − Θ(q) B q with
Θ(j) ∈ Rn×n for 1 ≤ j ≤ q.
VARMA models are not identified without additional conditions on the represen-
tation matrices, hence we will keep only the auto-regressive part (VAR) of the
model in our research. We made this choice because past values of wind direction
and speed are likely to influence the current values of the time series.
Usually to estimate ARMA and VAR parameters one uses the maximum likelihood
estimation method. This technique finds values of the parameters which maximize
the probability of obtaining the data that are observed. Unfortunately, in the case
of the VAR model, the more the dimension of the time series will be high, the
more likely the model will be overfitting, especially for a small sample size. This
phenomenon is called the curse of dimensionality [9]. To avoid this phenomenon
one can add a penalty similar to lasso in the objective function to minimize.
Definition 1.8. The VAR(p) model with the following objective function is de-
fined as the penalized VAR(p) model:
T p
X X
min ∥Yt − δ − Φ(l) Yt−l ∥2F + λ∥Φ∥1
δ,Φ
t=1 l=1
where ∥.∥F is defined as the Frobenius norm, Φ = [Φ(1) , . . . , Φ(p) ], ∥Φ∥1 is the
sum of the absolute value of each component of each matrix in Φ, and λ > 0 is a
penalty parameter that will be estimated using cross-validation.
Many tailored criteria exist for model selection, which is in our case choosing the
lag p for a VAR model. We will be focusing on the two following.
Definition 1.9. The following definitions comes from [10]. The AIC is defined as
AIC = −2lnLmax + 2k
where Lmax is the maximum likelihood achievable by the model and k is the
number of parameters of the model (Akaike 1974). The BIC was introduced by
Schwarz (1978) and is defined as
where N is the number of data points used in the fit. These criteria are used
for model selection to balance the quality of fit to observational data against the
complexity of the model: number, dimension, and distribution of the parameters.
In practice, the procedure is to compute these criteria for multiple numbers of
parameters, and then keep the number that minimizes one of the criteria.
8
1.3 Artificial neural networks
We suppose that the reader is familiar with the basic notions of artificial neural
networks (ANN). We will remind the definitions of the number of epochs and
batch size.
Definition 1.11. The batch size is a hyper-parameter that defines the number
of samples to work through before updating the internal model parameters. The
training dataset can be divided into one or multiple batches of identical size (one
batch may be smaller if evenly division can not be achieved).
Now, we will briefly introduce the two architectures of ANN, with their peculiarity,
that will be used during our research.
Definition 1.13. Long short-term memory (LSTM) networks belong to the class
of recurrent neural networks (RNN). LSTM networks can process not only single
data points but also entire sequences of data thanks to their feedback connections.
A common LSTM layer is composed of an input gate, an output gate and a forget
gate. To find more information on LSTM network architecture and characteristics,
please refer to Yu 2019 [12].
9
Chapter 2
Dataset
Legend:
10
2.1 Data source
We will keep the focus only on the mean direction and mean speed during this
research. One could be interested in using also the minimum, maximum, and std
variables as inputs, to forecast the wind velocity vector. Further research should
be done on this subject.
Since the average direction is given in degrees, we first convert it to radians using
the formula θrad = θdeg × π/180. Then, to help simplify the component model
we will build, we normalize the wind direction by removing the prevailing wind
11
direction, which is computed using the following equation [4]:
tan−1 (S/C)
S > 0, C > 0
θ= tan−1 (S/C) + π C < 0
tan−1 (S/C) + 2π S < 0, C > 0
where C = Σni=1 cos(θi ) and S = Σni=1 sin(θi ) which represent the sum of the lon-
gitudinal (resp. latitudinal) of the direction component. We are interested in
forecasting the wind velocity vector in R2 . For this we decompose the vector as
the lateral and longitudinal components as follow:
vt,x = ut × cos(θt − θ)
vt,y = ut × sin(θt − θ)
where ut ∈ R+ is the wind speed and θt ∈ [0, 2π] is the wind direction at time
t ∈ T . We call vt,x the cosine speed and vt,y the sinus speed.
500
20
400
10
300
Wind Y [m/s]
200
10
100
20
30 0
30 20 10 0 10 20 30
Wind X [m/s]
Figures 2.2, 2.4 and 2.3 display respectively the heat map distribution in R2 and
the density plots for the wind speed and wind direction of our final dataset.
12
0.25
0.20
Density
0.15
0.10
0.05
0.00
0 5 10 15 20 25 30 35
Speed in m/s
The wind direction seems to be mainly in the north-south axis. The wind speed
on the other hand seems to follow a Poisson distribution. In the final dataset, the
prevailing wind direction is θ = 1.476 which corresponds to 57.3 − 90 = −32.7
degrees if we consider 0 degree being the north axis. The mean wind speed is
5.456 m/s and the mean direction is 180.5 degrees.
Once forecasts v̂t,x , v̂t,y are obtained for the lateral and longitudinal components
at time t ∈ T , the corresponding wind direction forecast can be computed with
the following equation:
v̂t,y
θ̂t = tan −1
+θ
v̂t,x
Similarly, we can determine the forecast wind speed as follow:
q
2 2
ût = (v̂t,x + v̂t,y )
13
0.006
0.005
0.004
Density
0.003
0.002
0.001
0.000
0 50 100 150 200 250 300 350
Direction in degrees (0° is north)
We will compare different metrics across different models to benchmark the per-
formance
of their forecasts. Since the forecast for time t ∈ T is returned in a tuple
v̂t,x
as = v̂t , we are particularly interested in the error between the vector of
v̂t,y
vt,x
speed vt = and its corresponding forecast v̂t . Hence, we will focus on the
vt,y
Mean Distance Error (MDE) to assess the performance quality, with the error
being the Euclidean distance between the real vector and the forecast vector. The
MDE is computed as follow:
N
1 X
MDE = × ∥v̂t − vt ∥2
N t=1
14
Two other metrics will also be reported, the Mean Absolute Error (MAE) for the
wind speed and the wind direction. They are computed using these formulas:
N
1 X
MAEspeed = × |ût − ut |
N t=1
( PN
|θ̂t −θt |
t=1
0 < θt < π
MAEdirection = PN N
t=1 2π−|θ̂t −θt |
N
π < θt < 2π
15
Chapter 3
Short-term prediction
3.1 Methodology
The dataset is divided into training and test sets using the ratio 0.8/0.2, which
means there will be 2 × 12 211 training data points, and 2 × 3 053 test data points
for every 48 locations. We need to always have consecutive values in the training
and testing time series hence the split can not be done randomly. Only two possi-
ble sets of training and test sets are usable: the first is with the test set being the
first 20% of the data points and the second is with the test set being the last 20%.
Since the distribution of the data might vary between the beginning of the time
series (September) and its end (December), we will train distinct models for both
frameworks, compute the resultant metrics and then return the averaged metrics.
In this chapter, one-step out-of-sample forecasts will be made. With the current
dataset, this corresponds to 10 minutes ahead forecast. We will then return the
average error metrics over the 508 hours of data in each test set.
For the ARMA models, computations are made using the statsmodels Python
module. For one location, one ARMA model per component (lateral and longitu-
dinal) will be used and we will build one specific model per location. Therefore
our model will be composed of 2 × 48 ARMA(p,q) models.
The first step to establishing the ARMA model is identification. In order to iden-
tify and then develop the model, the stationarity assumption of the lateral and
16
longitudinal components time series should be checked. For this, we use the ADF
test whose null hypothesis is: the time series is non-stationary. Since the p-value
was smaller than 10−4 for both the longitudinal and lateral components, we reject
the null hypothesis for both tests. If the time series were not stationary, it would
have been more appropriate to use ARIMA(p,d,q) models which makes the times
series stationary using its d-order derivative.
Once the stationarity has been verified, the parameters p and q have to be chosen.
To choose the parameters of the ARMA models, one can analyze PACF and ACF
plots.
0.8 0.8
0.6 0.6
0.4 0.4
0.2
0.2
0.0
0.0
0 10 20 30 40 0 10 20 30 40
Autocorrelation Autocorrelation
1.0 1.0
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0.2 0.2
0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200
Figures 3.1 and 3.2 only represent one location, the plots are very similar for
every other location though. Their analysis suggests to use p = 1 and q ≤ 150.
Indeed, p = 1 is the only lag well above the significance line in the PACF. In the
case of the ACF plot though, every q ≤ 150 is well above the significance line.
17
Hence, we will train ARMA models with p = 1 and 1 ≤ q ≤ 6 for the location-
wise approach. We decide to keep q ≤ 6 based on the balance between the time
to train (number of parameters here) each model and the enhancement of the
performance. Remember that q = 6 means that the moving average coefficient
parameters are estimated for the past hour (6 × 10 = 60 minutes) which is, in our
opinion, a convenient threshold.
For the VAR models, computations are also been made using the statsmodels
Python module. For one location, the natural approach is to train one VAR
model, with the lateral and longitudinal components being the 2 dimensions of
the multivariate time series. For multiple locations there exists at least two alter-
natives:
• Location-wise approach: train one VAR model per location. Therefore, our
model will be composed of 48 VAR(p) models of dimension 2.
• Single entity approach: assume that past values of some locations will influ-
ence values in another location. Here, there will be only one VAR(p) model
but with dimension being the double of the number of locations: 96.
To test causality even before building the VAR model, one can use Granger’s
Causality test. For the location-wise approach, we need to check, for each loca-
tion, if the cosine speed causes the sinus speed or if the sinus speed causes cosine
speed. If at least one of the two propositions is verified, then building a VAR
model for each location will be legitimate. For each location, the minimum p-
value obtained from the test is smaller than 10−4 for both tests. Even if we apply
the Bonferroni correction, we can still safely reject the null hypotheses which are
that each time series is not causing the other. Since each component is directly
related, by construction, to the wind speed u this result is not surprising. For the
single entity approach, we compute the Granger’s Causality test for all possible
combinations of time series in our dataframe and store the p-values of each com-
bination in a 96 × 96 matrix. The mean number of p-value smaller than 10−4 in
each row is higher than 42. The mean number of p-value smaller than 10−4 in each
column is also higher than 42. This means that, given a time series in the single
entity dataset, on average, it is caused by 42 other time series and causes 42 time
series also. This result motivates the single entity approach but at the same time,
there is a large amount of time series that don’t cause one another. To face this
issue, the penalized VAR model seems to be more appropriate (subsection 3.1.3).
We will still train the single entity VAR models, to have a benchmark against the
penalized VAR models.
18
Now that the causation has been verified, the choice of the parameter p has to be
made. Table 3.3 displays the computation of the AIC and BIC criteria for different
lag p with * highlighting the minimum. In the location-wise case, we only display
the criteria for one location, since they are similar for all 48 locations. For the
location-wise approach, based on the AIC and the BIC, we decide to train VAR(p)
models for p = 1, . . . , 6. Similarly, for the single entity approach, we decide to
train VAR(p) models for p = 1, 2, 3. Even if BIC suggests using smaller p, we are
interested in the behavior of the results while increasing p.
19
3.1.4 Deep learning models
To tune the parameters of the deep learning models, we will use a portion of the
training set as the validation set. Hence, here the dataset will be separated into
training sets, validation sets, and test sets with respective proportions 0.7, 0.1,
and 0.2. Therefore, the test sets are the same as for the ARMA and VAR-based
models and the different models will be compared with metrics computed on the
same sets.
The models will make a set of predictions based on a window of consecutive sam-
ples from the data. The width (number of time steps) of the input window is a
parameter of the models. We call lag the input width. We computed the metrics
with a single lag (10 minutes), and also for a lag of 6 (1 hour).
Like for the VAR models, we will benchmark the performance of the 2 alternatives:
location-wise approach and single entity approach. The single entity approach is
not built in the same way as the single entity approach of the VAR models. More
details on this topic can be found in the next chapter. We will also take interest in
another approach: the single distribution. We could suppose that the distribution
of each time series is similar for all locations. In this case, we can train one single
model for the whole dataset with input and output dimension of 2. In theory, we
could follow this approach also for the ARMA and VAR models. Unfortunately, in
practice, the implementation of such models would require more time since we can
not simply concatenate the time series of each location. If we do so, there will be
95 jumps from the value of December 16th for location i to the value of September
2nd for location j. We think these jumps would highly bias the models and the
results, hence we will not follow this method. However, for deep learning models,
this is not an issue anymore. Please refer to the next chapter for more information.
We will also discuss in more detail the architecture of the deep learning models
in the next chapter.
The table on page 21 reports the error metrics on the test sets, rounded to the
third decimal, obtained by each model for each approach followed. To evaluate
the performance of each model, we will use a basic benchmark which is a simple
baseline that forecasts one step ahead of each provided time series by its last value:
X̂t+1 = Xt where X̂t+1 is the one-step forecast of time series {Xn }tn=1 .
20
Approach Model MDE MAE speed MAE direction
Last baseline 0.807 0.514 0.245
Location-wise ARMA(1,1) 0.803 0.518 0.249
Location-wise ARMA(1,2) 0.801 0.517 0.249
Location-wise ARMA(1,3) 0.801 0.517 0.249
Location-wise ARMA(1,4) 0.800 0.517 0.249
Location-wise ARMA(1,5) 0.800 0.517 0.249
Location-wise ARMA(1,6) 0.800 0.517 0.249
Location-wise VAR(1) 0.804 0.518 0.248
Location-wise VAR(2) 1.054 0.681 0.317
Location-wise VAR(3) 1.087 0.701 0.328
Location-wise VAR(4) 1.101 0.709 0.333
Location-wise VAR(5) 1.108 0.712 0.335
Location-wise VAR(6) 1.110 0.714 0.336
Single entity VAR(1) 0.820 0.530 0.264
Single entity VAR(2) 1.538 0.978 0.429
Single entity VAR(3) 1.541 0.980 0.429
Single entity Penalized VAR(1) 0.927 0.647 0.285
Single entity Penalized VAR(2) 0.944 0.663 0.290
Single entity Penalized VAR(3) 0.958 0.674 0.293
Location-wise LSTM with lag=1 0.832 0.545 0.252
Location-wise LSTM with lag=6 0.842 0.556 0.260
Location-wise CNN with lag=1 0.813 0.524 0.252
Location-wise CNN with lag=6 0.811 0.525 0.253
Single distribution LSTM with lag=1 0.812 0.524 0.249
Single distribution LSTM with lag=6 0.804 0.521 0.249
Single distribution CNN with lag=1 0.808 0.521 0.248
Single distribution CNN with lag=6 0.806 0.521 0.248
Single entity LSTM with lag=1 1.833 1.373 0.446
Single entity LSTM with lag=6 2.016 1.479 0.493
Single entity CNN with lag=1 1.199 0.795 0.361
Single entity CNN with lag=6 1.272 0.851 0.379
21
Many observations can be made from the table containing the errors metrics.
First, no model is able to beat the last baseline regarding the MAE of speed and
direction. Some models do perform better than the last baseline in terms of the
MDE though: all the ARMA models, the location-wise VAR(1), and the LSTM
network and CNN with lag equals 6 following the single distribution approach.
Now, if we focus on each model individually, we can notice that the results of the
ARMA models are almost identical, no matter what order 1 ≤ q ≤ 6 was used.
This could mean that the last value of the time series gives enough information
for the ARMA model to perform well. Unfortunately, for the VAR and penal-
ized VAR model the opposite phenomenon is present: increasing the order of the
model reduces the performance of the forecasts. It is interesting to note that the
penalized VAR models are performing better than both approaches using VAR
models for equal order p ≥ 2. This might demonstrate the VAR model is overfit-
ting when the order is increased. Regarding the different approaches used for the
VAR model, the location-wise strategy has better results than the single entity
approach. The single entity approach seems also to be the least recommended ap-
proach to follow for deep learning models, especially using LSTM networks. The
results from the single distribution approach are encouraging though, they are of
the same order as the ARMA models. In most of the cases, the CNN outperforms
the LSTM networks except for the single distribution with lag equals 6 procedure
which is the deep learning model with the best performance here. Increasing the
input width for the deep learning models gives various outcomes, depending on
both the approach and the artificial neural network used.
Figure 3.3 displays the one-step forecasts against the true values for 10h in one
location, for the ARMA(1,5) component-wise model, and the repeat baseline for
the wind speed and direction. In this figure, the predictions from the ARMA
model are close to the ones made by the last baseline.
3.3 Discussion
In Erdem 2011 [4], the order of the mean results of MAE of speed and direc-
tion over locations is higher than the results we obtained. It would have been
interesting to have the last baseline benchmark from their data to compare the
performance of the models. Anyway, this phenomenon highlights the high vari-
ance of the wind speed and direction distributions depending on the location.
22
speed-average: Forecasts vs Actuals
True value
11 Last baseline
ARMA(1,5)
10
m/s
20
18
deg
16
14
12
10
Our final dataset was composed of wind data for only 3 months. The distribution
of the wind speed and direction could differ for different months or years. Indeed,
this thought was consolidated by the results that were varying depending on the
train-test split used to train the models. For example, for the single entity CNN
model, the results had a 0.03 difference depending on the training/test split and
a difference of almost 0.1 for the VAR(6) model.
To enhance the results of the ARMA-based models, for the wind direction, we
could use a link function to do the conversion between a linear variable and a
circular variable [14].
23
True values
Location-wise forecast
Single entity forecast
2
1
Sinus speed (m/s)
The objective function minimized by the ARMA and VAR models is not the MDE,
but the mean squared error. One could change the different modules used in this
research to minimize our hand-made objective function.
Some locations in our dataset are separated by hundreds of kilometers and are
very unlikely that their time series causes one another. Hence, another approach
to this multiple locations framework could be to build a specific model for clus-
ters of locations close to each other, regarding the Euclidean distance. This could
also be done by defining a notion of distance based on a causation test statistic.
Further research could be done on algorithms to build appropriate clusters for
multiple locations.
Classical ARMA and VAR models are typically well-suited for short-term fore-
casts, but not for longer-term forecasts due to the convergence of the auto-
regressive part of the model to the mean of the time series, as it is shown in
Figure 3.4. On top of that, the implemented version of these models is not con-
ceived to minimize an objective function for a forecast horizon strictly superior
to 1. Hence, if we increase the forecast horizon for the VAR and ARMA models,
the error metrics will increase promptly, like in Figure 3.5. Furthermore, when
we increase the lag in the models, the number of parameters is increased pro-
portionally to the number of variables used as inputs which highly expands the
computational time. Therefore, for the multi-step forecasting part, we focus our
research on deep learning models.
24
VAR Location-wise VAR Single entity
0
00:00 03:00 06:00 09:00 12:00 15:00 18:00 21:00 00:00 00:00 03:00 06:00 09:00 12:00 15:00 18:00 21:00 00:00
Forecasting horizon Forecasting horizon
MDE MAE speed MAE direction
Figure 3.5: Error metrics for long term forecasts of VAR models
25
Chapter 4
4.1 Methodology
Computations for the deep learning models are made using the Python library
TensorFlow . A TensorFlow tutorial to perform time series forecasting is available
here. We will follow this tutorial while adapting it to our dataset and needs. Our
dataset requires to be converted to windows of consecutive data before training
the models (see Figures 4.1 and 4.5). Depending on the approaches, the window
dataset has to be built distinctly.
26
There exists a large number of ANN that could be relevant to forecast multi-
variate time series similar to our dataset. We decide to fixate on two different
architectures of neural networks: a convolution neural network (CNN) and a re-
current neural network with a layer called Long Short-Term Memory (LSTM). A
recurrent model can learn to use a long history of inputs if it’s relevant to the
predictions the model is making. Here the model will accumulate internal state
for 12/24 hours, before making a single prediction for the next 12/24 hours.
This means that if the number of output time steps increases, then the number
of neurons in the hidden layer should decrease to avoid overfitting. We base our
choice of the number of neurons in the hidden layer on this formula and also on
plot analysis of train and valid losses.
Usually to train deep learning models, the most common loss function to use is
the mean squared error. Based on this, we use the MSDE as our loss function.
We will also report the MDE and the MAE for wind speed and direction as error
metrics.
To assess the quality of performance of the models, we will also compute the
metrics for 2 simple baselines:
• Last baseline: repeat the last input time step for the required number of
output time steps.
• Repeat baseline: repeat the entire input time series as the output time series.
27
4.1.1 Location-wise approach
Here, we build a window dataset per location. For each one of them, we train
specific deep learning models with 2 features (cosine and sinus speed) as input
and 2 features as output.
We add a normalization layer to each artificial neural network with the mean
and standard deviation of each feature being calculated from the training set of
the specific location. The mean and standard deviation should only be computed
using the training data so that the models have no access to the values in the
validation and test sets.
For single-step prediction, we use 128 neurons in the hidden layer of both neural
networks, and 32/16 neurons for 12/24 hours of multi-step forecasting.
One could argue that the distribution of each time series is similar for all loca-
tions. In this case, we need to build the window dataset using time series of all
locations. To ensure that each batch is containing only consecutive values of data
from the same location, we perform first the training/validation/test split and
division into batches separately on each location and then we concatenate the
batches. Following this procedure, we will also make sure that the batches are the
same as in the location-wise approach for each training, validation, and test set.
Now we only have to train each deep learning model once, on a window dataset
with 2 features with more input samples than for the location-wise approach.
We use the same number of neurons in the hidden layer as in the location-wise
approach.
For metric constructions reasons, the single entity approach is slightly different
than the one followed in Chapter 3. Indeed, what we consider as a single entity
28
here is only the input. The information of all locations is used as input, but the
output is only the 2 features of a specific location. Hence, we need to train differ-
ent models for each location, using each time the input of all locations. Therefore,
the window dataset is built similarly to the location-wise approach, except that
the input has now 96 features.
Like for the location-wise approach, we also add a normalization layer to each
artificial neural network with the mean and standard deviation of each feature
being the one of the training set.
Since the number of samples is reduced compared to the single distribution ap-
proach, and the risk of overfitting augmented with the increase of input features,
the deep learning models used have to have fewer neurons in their hidden layer.
For one-step-ahead prediction, there are 64 neurons, and for 72/144 steps ahead
there are 16/8 neurons in the hidden layer.
Table 4.1 displays the error metrics on the test sets, rounded to the third decimal,
obtained by each deep learning model for each approach followed. First of all,
we have multiplied the number of forecasting steps by 72 since the single-step
predictions. However, the error metrics increase by a factor between 2 and 5 de-
pending on the models and the metric. Now, regarding 12 hours predictions only,
the important gap of performance values between the last and repeat baselines
could suggest that the last value of the time series itself gives more information
on the future values 12 hours ahead than the 12 past hours. This gap is especially
significant for the direction error. The last baseline has also the best performances
regarding the MAE of speed and direction. However, 3 of the deep learning mod-
els were able to best the last baseline for the MSDE and MDE: LSTM network
29
and CNN for the single distribution approach and CNN for the location-wise ap-
proach. The metrics for location-wise and single distribution strategies are close
for the CNN model. Nevertheless, this is not the case for LSTM networks which
have better performance in the single distribution approach. Once again, the sin-
gle entity has the worst performance results, even if the MAE of the speed of the
CNN model is encouraging. For 12 hours predictions, which corresponds to 72
steps, the CNN model appears to be a suitable model to use against the LSTM
network.
For 144 steps forecasting (see Table 4.2), the proportional gap between last and
repeat baselines has been reduced, especially in terms of the wind speed MAE.
Even if the number of forecasting steps was doubled, the error metrics have only
been increased by a factor much smaller than 0.5 for most of them. The results
are consistent with the previous ones obtained for 12 hours in terms of ranking of
approaches and artificial neural networks. The last baseline has the best perfor-
mances regarding the MAE of speed and direction. However, like for the 12 hours
predictions framework, the same three deep learning models were able to have
MSDE and MDE below the last baseline benchmark. Between all deep learning
models, single entity CNN had the best results in terms of the MAE of speed. This
is encouraging since its MSDE and MDE results are also close to the top 3 models.
The single distribution approach has shown better results both for 12 hours and
24 hours forecasts. This could be due to a lack of data per location for both other
alternatives when using a large windows dataset like here. Given the performance
of the CNN in the location-wise approach, we would suggest using this model as
the first choice when using a larger dataset than ours.
30
Figures 4.1, 4.2, 4.3 and 4.4 show 3 different single-batches of the cosine speed fea-
ture of a window dataset constructed for 12 hours forecasting for a given location.
We can compare the performance for this specific batch between the location-wise
CNN and both last and repeat baselines. For the 3 windows, the CNN model
seems to have learned to use the previous values to predict the pattern of future
values proficiently. Similarly, Figures 4.5, 4.6, 4.7 and 4.8 show 3 different single-
batches of the cosine speed feature of a window dataset constructed for 24 hours
forecasting for a given location. The same interpretation as for 12 hours predic-
tions can be made for the CNN location-wise model here. It is also noticeable
that the repeat baseline forecasts are either very close (second and third window)
or are completely distant (first window) from the true values.
15.0 Inputs
Labels
12.5
10.0
7.5
5.0
0 5 10 15 20 25
5
0
5
10
15
0 5 10 15 20 25
14
cosine-speed (in m/s)
12
10
8
6
4
0 5 10 15 20 25
Time (in hours)
31
15.0 Inputs
Labels
12.5 Predictions
10.0
7.5
5.0
0 5 10 15 20 25
5
0
5
10
15
0 5 10 15 20 25
14
cosine-speed (in m/s)
12
10
8
6
4
0 5 10 15 20 25
Time (in hours)
15.0 Inputs
Labels
12.5 Predictions
10.0
7.5
5.0
0 5 10 15 20 25
5
0
5
10
15
0 5 10 15 20 25
14
cosine-speed (in m/s)
12
10
8
6
4
0 5 10 15 20 25
Time (in hours)
32
15.0 Inputs
Labels
12.5 Predictions
10.0
7.5
5.0
0 5 10 15 20 25
5
0
5
10
15
0 5 10 15 20 25
14
cosine-speed (in m/s)
12
10
8
6
4
0 5 10 15 20 25
Time (in hours)
0
2
4
6
Inputs
8 Labels
0 10 20 30 40 50
2
0
2
4
6
0 10 20 30 40 50
7.5
cosine-speed (in m/s)
5.0
2.5
0.0
2.5
0 10 20 30 40 50
Time (in hours)
33
0
2
4
6 Inputs
Labels
8 Predictions
0 10 20 30 40 50
2
0
2
4
6
0 10 20 30 40 50
7.5
cosine-speed (in m/s)
5.0
2.5
0.0
2.5
0 10 20 30 40 50
Time (in hours)
0
2
4
6 Inputs
Labels
8 Predictions
0 10 20 30 40 50
2
0
2
4
6
0 10 20 30 40 50
7.5
cosine-speed (in m/s)
5.0
2.5
0.0
2.5
0 10 20 30 40 50
Time (in hours)
34
0
2
4
6 Inputs
Labels
8 Predictions
0 10 20 30 40 50
2
0
2
4
6
0 10 20 30 40 50
7.5
cosine-speed (in m/s)
5.0
2.5
0.0
2.5
0 10 20 30 40 50
Time (in hours)
4.3 Discussion
With enough computational resources, the metrics could be computed for differ-
ent values of batch size and the number of neurons in the hidden layer. This could
confirm the performance, or even maybe enhance it for some models. Moreover,
an intrinsic modification on the artificial neural networks could be done: adding a
dropout layer in the network, to prevent overfitting. The role of the dropout layer
is to set randomly inputs units to 0 based on a given frequency. This option could
be particularly useful in the case of the single entity approach. Instead of using
single shots models and predicting the entire output sequence in a single step, it
may be helpful for the model to decompose this prediction into individual time
steps. Then, each model’s output can be fed back into itself at each step, and
predictions can be made conditioned on the previous one. We call these multi-step
deep learning models, auto-regressive deep learning models.
There exists an infinite number of artificial neural networks more or less adapted
to forecasting multivariate times series. CNN and LSTM networks are common
artificial neural networks, more complex ANN could be more appropriate to the
multi-locations purpose. For instance, Barbounis 2007 [16] focused their research
on a locally feedback dynamic fuzzy neural network trained on data from 3 sta-
tions. One might adapt this network for the framework presented in this the-
35
sis. Zhu 2018 [17] also proposed a model for wind speed prediction with spatio-
temporal correlation: the predictive deep convolutional neural network (PDCNN).
36
Concluding remarks
In this study, multiple models for forecasting wind velocity vectors were proposed
for single and multi-step predictions in multiple locations. They were evaluated
using at least three metrics: the MDE of the velocity vector, the MAE of speed,
and the MAE of direction. By analyzing the forecasting results, we can summa-
rize the following. None of the models, in both frameworks, were able to have
performance metrics significantly better than the last baseline. Nevertheless, in
terms of MDE, we suggest using an ARMA model per feature for short-term pre-
dictions. For multi-step forecasting, the more appropriate deep learning model
to use is a CNN following either the location-wise or the single distribution ap-
proach, depending on the amount of data available. The intrinsic spatial relation
of the dataset was either not enough significant or the models introduced were
not able to learn from it. Based on the results, we are still convinced that with
a larger dataset and more time to investigate more models, the single entity ap-
proach should enhance the results over the location-wise approach. Those findings
could be significant for the research on wind velocity vector forecasting and could
stimulate follow-up studies in the future.
37
Bibliography
[1] W. Chang, “Short-term wind power forecasting using epso based hybrid
method. energies, 6, 4879-4896,” 2013.
[2] W.-Y. Chang et al., “A literature review of wind forecasting methods,” Jour-
nal of Power and Energy Engineering, vol. 2, no. 04, p. 161, 2014.
[4] E. Erdem and J. Shi, “Arma based approaches for forecasting the tuple of
wind speed and direction,” Applied Energy, vol. 88, no. 4, pp. 1405–1414,
2011.
[7] C. Hiemstra and J. D. Jones, “Testing for linear and nonlinear granger causal-
ity in the stock price-volume relation,” The Journal of Finance, vol. 49, no. 5,
pp. 1639–1664, 1994.
[9] R. Bellman, “Dynamic programming,” Science, vol. 153, no. 3731, pp. 34–37,
1966.
38
[12] Y. Yu, X. Si, C. Hu, and J. Zhang, “A review of recurrent neural networks:
Lstm cells and network architectures,” Neural computation, vol. 31, no. 7, pp.
1235–1270, 2019.
[13] W. Nicholson, D. Matteson, and J. Bien, “Bigvar: Tools for modeling sparse
high-dimensional multivariate time series,” arXiv preprint arXiv:1702.07094,
2017.
[17] Q. Zhu, J. Chen, L. Zhu, X. Duan, and Y. Liu, “Wind speed prediction with
spatio–temporal correlation: A deep learning approach,” Energies, vol. 11,
no. 4, p. 705, 2018.
39