Master Thesis Jimmy

Forecasting wind velocity vector for
multiple locations
Jimmy Wilde
École Polytechnique Fédérale de Lausanne
Under the supervision of
Prof. Petros Dellaportas
University College of London - Athens University of Economics and Business -

The Alan Turing Institute
Master Thesis
Autumn 2021
Acknowledgements
I gratefully thank Professor Petros Dellaportas for his help all along with my
master thesis. Given the current sanitary situation, we have not been able to
work alongside physically. Nevertheless, he has always taken the time to help me
remotely and gave me even more motivation for a fascinating project.
1
Contents
1 Preliminaries 6
1.1 Time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 (V)ARMA models . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Artificial neural networks . . . . . . . . . . . . . . . . . . . . . . . 9
2 Dataset 10
2.1 Data source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Data filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Data pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Statistical description . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Performance measures . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Short-term prediction 16
3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.1 ARMA models . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.2 VAR models . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1.3 Penalized VAR models . . . . . . . . . . . . . . . . . . . . 19
3.1.4 Deep learning models . . . . . . . . . . . . . . . . . . . . . 20
3.2 Performance analysis . . . . . . . . . . . . . . . . . . . . . . . . . 20
2
3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4 Multi-step deep learning models 26
4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1.1 Location-wise approach . . . . . . . . . . . . . . . . . . . . 28
4.1.2 Single distribution approach . . . . . . . . . . . . . . . . . 28
4.1.3 Single entity approach . . . . . . . . . . . . . . . . . . . . 28
4.2 Performance analysis . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Bibliography 39
3
Introduction
Well-suited models able to forecast different characteristics of the wind have been
deeply studied during the last decades. Indeed, since wind power is one of the
most attractive renewable energy (high efficiency and low pollution) [1], many
scientists have focused their research on this topic. There exist multiple fore-
casting methods, in addition to statistical methods, like using numerical weather
prediction (NWP) which are meteorological or geographic data like temperature,
pressure, surface roughness, and obstacles [2].
In this study, we are focusing on the forecasting of wind speed and direction tu-
ple for multiple locations using only statistical and deep learning methods. This
means that no additional information, meteorological or geographic, will be used
to perform the predictions. We call the tuple composed of information on the
wind speed and its direction: the wind velocity vector. Most of the literature
nowadays is concentrated on the forecast of the wind speed alone (like, for exam-
ple, Palomares-Sala 2009 [3]), or of the wind power but not necessarily of the wind
velocity vector. However, Erdem & Shi 2011 [4] focused their research on the fore-
casts of wind velocity vectors for short-term prediction. Since the wind speed and
direction have good succession and randomness, it was appropriate for them to
use ARMA and VAR models of time series to predict the wind velocity vector for a
small number of data points ahead. The dataset we will use is composed not only
of the wind velocity information in one location but in multiple locations, hence
we will work not uniquely with single location forecasting but also with multiple
locations forecasting. Our goal is to search different models that could predict
wind velocity vectors as well as classical ARMA and VAR models but also, that
could use the intrinsic spatial relation of the data to possibly outperform these
models. The intuition is based on the fact that, for close locations at least, there
should exist a causality between the value of the time series in location A and the
past and/or future values of the time series in location B.
Unfortunately, ARMA and VAR-based models are not adequate candidates for
this task because of their quick convergence to the stationary distribution of the
time series process. Deep learning models are a valid alternative though. Some
research on different artificial neural networks to predict the mean monthly speed
4
using neighboring stations has already been done by Bilgili 2017 [5]. Similarly, we
intend to find deep learning models to forecast the wind velocity vector, but for
a forecast range between 12 and 24 hours.
The remainder of this thesis is organized as follows. In the next section, we re-
call to the reader the main mathematical concepts used throughout the research.
Then, the dataset will be presented from its raw form to its final form on which
the computations will be made. In the third section, the discussion focuses on
the models adequate for short-term forecasting and their performances. In the
section afterward, multi-step deep learning models are presented, and discussion
is provided on their results. In the last section, the main findings are summarized.
For reproducibility purposes, a GitHub project was created, containing all the
notebooks and scripts designed throughout this project. All the libraries used
were updated to the latest version available at the current time of the project.
5
Chapter 1
Preliminaries
In this chapter, we review the basic theory of time series, (V)ARMA models, and
artificial neural networks.
1.1 Time series
Definition 1.1. A time series is a set of observations {Xt }t∈T with Xt ∈ R, each
recorded at time t ∈ T . In time series analysis the index set T ∈ N is a set of
time points.
Definition 1.2. The time series {Xt }t∈T is said to be weakly stationary if for all
n ∈ N∗ , for any t1 , . . . , tn ∈ T and for all τ such that t1 + τ, . . . , tn + τ ∈ T all the
joints moments of order 1 and 2 of Xt1 , . . . , Xtn exist, are finite and are equal to
the corresponding joints moments of Xt1 +τ , . . . , Xtn +τ .
We can go beyond the first two moments and define strong stationarity.
Definition 1.3. The time series {Xt }t∈T is said to be strongly stationary if for
all n ∈ N∗ , for any t1 , . . . , tn ∈ T and for all τ such that t1 + τ, . . . , tn + τ ∈ T
all the joint distribution of Xt1 , . . . , Xtn is the same as the joint distribution of
Xt1 +τ , . . . , Xtn +τ .
In the rest of the thesis, when the term stationarity is used, it will refer to either
weak or strong stationarity by language abuse. Different methods can be used
to assert the stationarity of the time series. For this research, we will use the
Augmented Dickey-Fuller (ADF) test [6]. Its null hypothesis H0 is that the time
series is non-stationary.
6
Similarly, to check the causation between multiple time series we will refer to
Granger’s causality test [7]. This test’s null hypothesis is that the coefficients of
past values in the regression equation are zero, or in other terms that the past
values of a time series {Xt }t∈T are not causing values of the time series {Yt }t∈T .
Suppose that we need to test causality for multiple time series. The more infer-
ences are made, the more likely erroneous inferences become. Several statistical
techniques have been developed to address that problem, like the Bonferroni cor-
rection which is the most classical one.
Definition 1.4. Let H1 , . . . , Hm be a family of hypotheses and p1 , . . . , pm their
corresponding p-values. Let m be the total number of null hypotheses. The
familywise error rate (FWER) is the probability of rejecting at least one true Hi .
The Bonferroni correction rejects the null hypothesis for each pi ≤ m α
, thereby
controlling the FWER at ≤ α.
1.2 (V)ARMA models
Definition 1.5. An Auto-Regressive Moving Average Process with parameters

(p,q)∈ R2 is denoted ARMA(p,q) and is defined as a time series Xt following the
equation:
p q
X X
Xt = ϵt + βi Xt−i + ϕj ϵt−j + α
i=1 j=1
for t ∈ N. {βi }pi=1 are the coefficient parameters of the auto-regressive (AR)
model, {ϕj }j=1 are the coefficient parameters of the moving average (MA) model,
q
{ϵk }tk=1 are the error terms and α ∈ R is the constant parameter.
To determine the appropriate lags p and q of the ARMA models, one should
analyze the plot of the two following functions.
Definition 1.6. For a stationary time series Xt we define the Auto Correlation
Function (ACF) by
E[(Xt − E[Xt ])(Xt+τ − E[Xt ])]
ρτ =
Var(Xt )
Similarly, the Partial Auto Correlation Function (PACF) gives the partial correla-
tion of a stationary time series Xt with its own lagged values, regressed the values
of the time series at all shorter lags.
Definition 1.7. The Vector ARMA (VARMA) process is a multivariate version
of the ARMA model. Let Yt ∈ Rn , n ∈ N∗ be a vector of time series. The VARMA
model with parameters (p,q)∈ R2 is expressed as follows [8]:
Φp (B)Yt = δ + Θq (B)εt
7
where δ ∈ Rn is the constant parameter and B is the backward shift operator
such that B k × Yt = Yt−k for k = 1, . . . , t. The auto-regressive term of order p is
Φp (B) = I − Φ(1) B 1 − . . . − Φ(p) B p with Φ(i) ∈ Rn×n for 1 ≤ i ≤ p and similarly
the moving average term of order q is Θq (B) = I − Θ(1) B 1 − . . . − Θ(q) B q with
Θ(j) ∈ Rn×n for 1 ≤ j ≤ q.
VARMA models are not identified without additional conditions on the represen-
tation matrices, hence we will keep only the auto-regressive part (VAR) of the
model in our research. We made this choice because past values of wind direction
and speed are likely to influence the current values of the time series.
Usually to estimate ARMA and VAR parameters one uses the maximum likelihood
estimation method. This technique finds values of the parameters which maximize
the probability of obtaining the data that are observed. Unfortunately, in the case
of the VAR model, the more the dimension of the time series will be high, the
more likely the model will be overfitting, especially for a small sample size. This
phenomenon is called the curse of dimensionality [9]. To avoid this phenomenon
one can add a penalty similar to lasso in the objective function to minimize.
Definition 1.8. The VAR(p) model with the following objective function is de-
fined as the penalized VAR(p) model:
T p
X X
min ∥Yt − δ − Φ(l) Yt−l ∥2F + λ∥Φ∥1
δ,Φ
t=1 l=1
where ∥.∥F is defined as the Frobenius norm, Φ = [Φ(1) , . . . , Φ(p) ], ∥Φ∥1 is the
sum of the absolute value of each component of each matrix in Φ, and λ > 0 is a
penalty parameter that will be estimated using cross-validation.
Many tailored criteria exist for model selection, which is in our case choosing the
lag p for a VAR model. We will be focusing on the two following.
Definition 1.9. The following definitions comes from [10]. The AIC is defined as
AIC = −2lnLmax + 2k
where Lmax is the maximum likelihood achievable by the model and k is the
number of parameters of the model (Akaike 1974). The BIC was introduced by
Schwarz (1978) and is defined as
BIC = −2lnLmax + k lnN
where N is the number of data points used in the fit. These criteria are used
for model selection to balance the quality of fit to observational data against the
complexity of the model: number, dimension, and distribution of the parameters.
In practice, the procedure is to compute these criteria for multiple numbers of
parameters, and then keep the number that minimizes one of the criteria.
8
1.3 Artificial neural networks
We suppose that the reader is familiar with the basic notions of artificial neural
networks (ANN). We will remind the definitions of the number of epochs and
batch size.
Definition 1.10. The number of epochs is a hyper-parameter that defines the

number of times that the learning algorithm will work through the entire training
dataset and update its internal parameters. The number of epochs is in general
large, allowing the learning algorithm to run until the loss function has been
sufficiently minimized.
Definition 1.11. The batch size is a hyper-parameter that defines the number
of samples to work through before updating the internal model parameters. The
training dataset can be divided into one or multiple batches of identical size (one
batch may be smaller if evenly division can not be achieved).
Now, we will briefly introduce the two architectures of ANN, with their peculiarity,
that will be used during our research.
Definition 1.12. Convolutional neural networks (CNN) are used in applications

like computer vision tasks but also for time series. This specific network contains
a convolutional layer (CONV) which creates a convolution kernel that is convolved
with the input layer over a single temporal dimension, in the case of time series, to
produce a tensor of outputs. The CONV layer will compute the output of neurons
that are connected to local regions in the input, each computing a dot product
between their weights and a small region they are connected to, in the input time
series. More specifics of CNN can be found in O’Shea, K., & Nash, R. (2015) [11].
Definition 1.13. Long short-term memory (LSTM) networks belong to the class
of recurrent neural networks (RNN). LSTM networks can process not only single
data points but also entire sequences of data thanks to their feedback connections.
A common LSTM layer is composed of an input gate, an output gate and a forget
gate. To find more information on LSTM network architecture and characteristics,
please refer to Yu 2019 [12].
9
Chapter 2
Dataset
Figure 2.1: Map with location of wind turbines in the dataset
Legend:
10
2.1 Data source
Different information on the wind was given to us by the International Wind

Engineering company, from 94 locations all across Greece (see Figure 2.1), for
every 10 minutes between January 1st, 1998 and March 1st, 2000. This company is
an international consulting and engineering company, offering accredited services
in the field of Wind Energy since 2002. The minimum, maximum, mean, and
standard deviation (std) inside the 10 minutes intervals were reported for both
speed and direction. Wind speed is in m/s and direction in degrees, with 0 being
the north axis.
2.2 Data filtering
Unfortunately, most of the data is missing or incoherent. Hence, we have to filter

the data to find a large period of time with as many locations as possible and as
few missing values as possible. Using a custom-made grid search approach, the
final choice of dataset is containing 48 different turbine locations, with time series
from September 2nd, 1999 to December 16th, 1999, for a total of 105 days. The
longest sequence of missing values for one location is 40 minutes (= 4 consecutive
data points) and there is only a total of 65 missing values out of 2 × 732 672 data
points. We use linear interpolation for these values, which should not bias the
final interpretation given their small proportion. We also use interpolation for
values above 50 m/s, since it corresponds to a high value during a wind storm.
For comparison, in 2019, the highest wind speed was recovered at 132 km/h (=
36.6 m/s) by the meteorological service of the Greek National Weather Observa-
tory. Therefore, those values are highly suspicious and incoherent with the rest of
the distribution of data. They might come from a defective measurement of the
turbines.
We will keep the focus only on the mean direction and mean speed during this
research. One could be interested in using also the minimum, maximum, and std
variables as inputs, to forecast the wind velocity vector. Further research should
be done on this subject.
2.3 Data pre-processing
Since the average direction is given in degrees, we first convert it to radians using
the formula θrad = θdeg × π/180. Then, to help simplify the component model
we will build, we normalize the wind direction by removing the prevailing wind
11
direction, which is computed using the following equation [4]:
 tan−1 (S/C)

S > 0, C > 0
θ= tan−1 (S/C) + π C < 0
tan−1 (S/C) + 2π S < 0, C > 0

where C = Σni=1 cos(θi ) and S = Σni=1 sin(θi ) which represent the sum of the lon-
gitudinal (resp. latitudinal) of the direction component. We are interested in
forecasting the wind velocity vector in R2 . For this we decompose the vector as
the lateral and longitudinal components as follow:
vt,x = ut × cos(θt − θ)
vt,y = ut × sin(θt − θ)
where ut ∈ R+ is the wind speed and θt ∈ [0, 2π] is the wind direction at time
t ∈ T . We call vt,x the cosine speed and vt,y the sinus speed.
2.4 Statistical description
500
20
400
10
300
Wind Y [m/s]
200
10
100
20
30 0
30 20 10 0 10 20 30
Wind X [m/s]
Figure 2.2: Heat map distribution of the dataset
Figures 2.2, 2.4 and 2.3 display respectively the heat map distribution in R2 and
the density plots for the wind speed and wind direction of our final dataset.
12
0.25
0.20
Density
0.15
0.10
0.05
0.00
0 5 10 15 20 25 30 35
Speed in m/s
Figure 2.3: Wind speed density plot
The wind direction seems to be mainly in the north-south axis. The wind speed
on the other hand seems to follow a Poisson distribution. In the final dataset, the
prevailing wind direction is θ = 1.476 which corresponds to 57.3 − 90 = −32.7
degrees if we consider 0 degree being the north axis. The mean wind speed is
5.456 m/s and the mean direction is 180.5 degrees.
2.5 Performance measures
Once forecasts v̂t,x , v̂t,y are obtained for the lateral and longitudinal components
at time t ∈ T , the corresponding wind direction forecast can be computed with
the following equation:
v̂t,y
θ̂t = tan −1
+θ
v̂t,x
Similarly, we can determine the forecast wind speed as follow:
q
2 2
ût = (v̂t,x + v̂t,y )
13
0.006
0.005
0.004
Density
0.003
0.002
0.001
0.000
0 50 100 150 200 250 300 350
Direction in degrees (0° is north)
Figure 2.4: Wind direction density plot
We will compare different metrics across different models to benchmark the per-
formance
of their forecasts. Since the forecast for time t ∈ T is returned in a tuple
v̂t,x
as = v̂t , we are particularly interested in the error between the vector of
v̂t,y
vt,x
speed vt = and its corresponding forecast v̂t . Hence, we will focus on the
vt,y
Mean Distance Error (MDE) to assess the performance quality, with the error
being the Euclidean distance between the real vector and the forecast vector. The
MDE is computed as follow:
N
1 X
MDE = × ∥v̂t − vt ∥2
N t=1
Similarly, we define the Mean Squared Distance Error (MSDE) as:

N
1 X
MSDE = × ∥v̂t − vt ∥22
N t=1
14
Two other metrics will also be reported, the Mean Absolute Error (MAE) for the
wind speed and the wind direction. They are computed using these formulas:
N
1 X
MAEspeed = × |ût − ut |
N t=1
( PN
|θ̂t −θt |
t=1
0 < θt < π
MAEdirection = PN N
t=1 2π−|θ̂t −θt |
N
π < θt < 2π
15
Chapter 3
Short-term prediction
3.1 Methodology
The dataset is divided into training and test sets using the ratio 0.8/0.2, which
means there will be 2 × 12 211 training data points, and 2 × 3 053 test data points
for every 48 locations. We need to always have consecutive values in the training
and testing time series hence the split can not be done randomly. Only two possi-
ble sets of training and test sets are usable: the first is with the test set being the
first 20% of the data points and the second is with the test set being the last 20%.
Since the distribution of the data might vary between the beginning of the time
series (September) and its end (December), we will train distinct models for both
frameworks, compute the resultant metrics and then return the averaged metrics.
In this chapter, one-step out-of-sample forecasts will be made. With the current
dataset, this corresponds to 10 minutes ahead forecast. We will then return the
average error metrics over the 508 hours of data in each test set.
3.1.1 ARMA models
For the ARMA models, computations are made using the statsmodels Python
module. For one location, one ARMA model per component (lateral and longitu-
dinal) will be used and we will build one specific model per location. Therefore
our model will be composed of 2 × 48 ARMA(p,q) models.
The first step to establishing the ARMA model is identification. In order to iden-
tify and then develop the model, the stationarity assumption of the lateral and
16
longitudinal components time series should be checked. For this, we use the ADF
test whose null hypothesis is: the time series is non-stationary. Since the p-value
was smaller than 10−4 for both the longitudinal and lateral components, we reject
the null hypothesis for both tests. If the time series were not stationary, it would
have been more appropriate to use ARIMA(p,d,q) models which makes the times
series stationary using its d-order derivative.
Once the stationarity has been verified, the parameters p and q have to be chosen.
To choose the parameters of the ARMA models, one can analyze PACF and ACF
plots.
Partial Autocorrelation Partial Autocorrelation

1.0 1.0
0.8 0.8
0.6 0.6
0.4 0.4
0.2
0.2
0.0
0.0
0 10 20 30 40 0 10 20 30 40
(a) Cosine speed (b) Sinus speed
Figure 3.1: PACF for one location
Autocorrelation Autocorrelation
1.0 1.0
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0.2 0.2
0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200
(a) Cosine speed (b) Sinus speed
Figure 3.2: ACF for one location
Figures 3.1 and 3.2 only represent one location, the plots are very similar for
every other location though. Their analysis suggests to use p = 1 and q ≤ 150.
Indeed, p = 1 is the only lag well above the significance line in the PACF. In the
case of the ACF plot though, every q ≤ 150 is well above the significance line.
17
Hence, we will train ARMA models with p = 1 and 1 ≤ q ≤ 6 for the location-
wise approach. We decide to keep q ≤ 6 based on the balance between the time
to train (number of parameters here) each model and the enhancement of the
performance. Remember that q = 6 means that the moving average coefficient
parameters are estimated for the past hour (6 × 10 = 60 minutes) which is, in our
opinion, a convenient threshold.
3.1.2 VAR models
For the VAR models, computations are also been made using the statsmodels
Python module. For one location, the natural approach is to train one VAR
model, with the lateral and longitudinal components being the 2 dimensions of
the multivariate time series. For multiple locations there exists at least two alter-
natives:
• Location-wise approach: train one VAR model per location. Therefore, our
model will be composed of 48 VAR(p) models of dimension 2.
• Single entity approach: assume that past values of some locations will influ-
ence values in another location. Here, there will be only one VAR(p) model
but with dimension being the double of the number of locations: 96.
To test causality even before building the VAR model, one can use Granger’s
Causality test. For the location-wise approach, we need to check, for each loca-
tion, if the cosine speed causes the sinus speed or if the sinus speed causes cosine
speed. If at least one of the two propositions is verified, then building a VAR
model for each location will be legitimate. For each location, the minimum p-
value obtained from the test is smaller than 10−4 for both tests. Even if we apply
the Bonferroni correction, we can still safely reject the null hypotheses which are
that each time series is not causing the other. Since each component is directly
related, by construction, to the wind speed u this result is not surprising. For the
single entity approach, we compute the Granger’s Causality test for all possible
combinations of time series in our dataframe and store the p-values of each com-
bination in a 96 × 96 matrix. The mean number of p-value smaller than 10−4 in
each row is higher than 42. The mean number of p-value smaller than 10−4 in each
column is also higher than 42. This means that, given a time series in the single
entity dataset, on average, it is caused by 42 other time series and causes 42 time
series also. This result motivates the single entity approach but at the same time,
there is a large amount of time series that don’t cause one another. To face this
issue, the penalized VAR model seems to be more appropriate (subsection 3.1.3).
We will still train the single entity VAR models, to have a benchmark against the
penalized VAR models.
18
Now that the causation has been verified, the choice of the parameter p has to be
made. Table 3.3 displays the computation of the AIC and BIC criteria for different
lag p with * highlighting the minimum. In the location-wise case, we only display
the criteria for one location, since they are similar for all 48 locations. For the
location-wise approach, based on the AIC and the BIC, we decide to train VAR(p)
models for p = 1, . . . , 6. Similarly, for the single entity approach, we decide to
train VAR(p) models for p = 1, 2, 3. Even if BIC suggests using smaller p, we are
interested in the behavior of the results while increasing p.
p AIC BIC p AIC BIC

0 3.971 3.972 0 148.5 148.5
1 -1.329 -1.326 1 -55.63 -50.97*
2 -1.331 -1.326 2 -55.88 -46.61
3 -1.340 -1.333 3 -56.05* -42.17
4 -1.345 -1.336 4 -55.87 -37.38
5 -1.347 -1.336* 5 -55.63 -32.53
6 -1.348* -1.335 6 -55.38 -27.66
7 -1.347 -1.332 7 -55.09 -22.76
Location-wise Single entity
Table 3.3: VAR Order Selection
3.1.3 Penalized VAR models
Nicholson 2017 [13] have implemented a package on R that includes regularization

methods for VAR models. Since penalized VAR models are more appropriate in
the case of high dimension, we will focus only on the single entity approach with
a single penalized VAR(p) model with input dimension 96. As it is most likely
that for a location, it might not cause values of locations located far away from it,
this approach seems well-founded. This intuition was confirmed in the previous
section by the results obtained for the Granger’s Causality test of the single entity
dataset. Indeed, with the penalization added, we hope that multiple entries of
the parameter matrix Φ would be estimated close to 0. We decide to train the
penalized VAR model for p = 1, 2, 3, like for the single entity VAR models.
To estimate the optimal penalty parameter λ, cross-validation is conducted in

a rolling manner to account for time dependence. By default, λ is computed
by minimizing one-step ahead mean squared forecast error, using a grid search
approach.
19
3.1.4 Deep learning models
To tune the parameters of the deep learning models, we will use a portion of the
training set as the validation set. Hence, here the dataset will be separated into
training sets, validation sets, and test sets with respective proportions 0.7, 0.1,
and 0.2. Therefore, the test sets are the same as for the ARMA and VAR-based
models and the different models will be compared with metrics computed on the
same sets.
The models will make a set of predictions based on a window of consecutive sam-
ples from the data. The width (number of time steps) of the input window is a
parameter of the models. We call lag the input width. We computed the metrics
with a single lag (10 minutes), and also for a lag of 6 (1 hour).
Like for the VAR models, we will benchmark the performance of the 2 alternatives:
location-wise approach and single entity approach. The single entity approach is
not built in the same way as the single entity approach of the VAR models. More
details on this topic can be found in the next chapter. We will also take interest in
another approach: the single distribution. We could suppose that the distribution
of each time series is similar for all locations. In this case, we can train one single
model for the whole dataset with input and output dimension of 2. In theory, we
could follow this approach also for the ARMA and VAR models. Unfortunately, in
practice, the implementation of such models would require more time since we can
not simply concatenate the time series of each location. If we do so, there will be
95 jumps from the value of December 16th for location i to the value of September
2nd for location j. We think these jumps would highly bias the models and the
results, hence we will not follow this method. However, for deep learning models,
this is not an issue anymore. Please refer to the next chapter for more information.
We will also discuss in more detail the architecture of the deep learning models
in the next chapter.
3.2 Performance analysis
The table on page 21 reports the error metrics on the test sets, rounded to the
third decimal, obtained by each model for each approach followed. To evaluate
the performance of each model, we will use a basic benchmark which is a simple
baseline that forecasts one step ahead of each provided time series by its last value:
X̂t+1 = Xt where X̂t+1 is the one-step forecast of time series {Xn }tn=1 .
20
Approach Model MDE MAE speed MAE direction
Last baseline 0.807 0.514 0.245
Location-wise ARMA(1,1) 0.803 0.518 0.249
Location-wise VAR(1) 0.804 0.518 0.248
Single entity VAR(1) 0.820 0.530 0.264
Single entity Penalized VAR(1) 0.927 0.647 0.285
Location-wise LSTM with lag=1 0.832 0.545 0.252
Location-wise LSTM with lag=6 0.842 0.556 0.260
Location-wise CNN with lag=1 0.813 0.524 0.252
Location-wise CNN with lag=6 0.811 0.525 0.253
Single distribution LSTM with lag=1 0.812 0.524 0.249
Single distribution LSTM with lag=6 0.804 0.521 0.249
Single distribution CNN with lag=1 0.808 0.521 0.248
Single distribution CNN with lag=6 0.806 0.521 0.248
Single entity LSTM with lag=1 1.833 1.373 0.446
Single entity LSTM with lag=6 2.016 1.479 0.493
Single entity CNN with lag=1 1.199 0.795 0.361
Single entity CNN with lag=6 1.272 0.851 0.379
Table 3.4: Short-term forecasts performance by model
21
Many observations can be made from the table containing the errors metrics.
First, no model is able to beat the last baseline regarding the MAE of speed and
direction. Some models do perform better than the last baseline in terms of the
MDE though: all the ARMA models, the location-wise VAR(1), and the LSTM
network and CNN with lag equals 6 following the single distribution approach.
Now, if we focus on each model individually, we can notice that the results of the
ARMA models are almost identical, no matter what order 1 ≤ q ≤ 6 was used.
This could mean that the last value of the time series gives enough information
for the ARMA model to perform well. Unfortunately, for the VAR and penal-
ized VAR model the opposite phenomenon is present: increasing the order of the
model reduces the performance of the forecasts. It is interesting to note that the
penalized VAR models are performing better than both approaches using VAR
models for equal order p ≥ 2. This might demonstrate the VAR model is overfit-
ting when the order is increased. Regarding the different approaches used for the
VAR model, the location-wise strategy has better results than the single entity
approach. The single entity approach seems also to be the least recommended ap-
proach to follow for deep learning models, especially using LSTM networks. The
results from the single distribution approach are encouraging though, they are of
the same order as the ARMA models. In most of the cases, the CNN outperforms
the LSTM networks except for the single distribution with lag equals 6 procedure
which is the deep learning model with the best performance here. Increasing the
input width for the deep learning models gives various outcomes, depending on
both the approach and the artificial neural network used.
Overall, it is complicated to give a relevant conclusion on which model is the most

appropriate to use for single-step predictions. However, what is flagrant is that
none of the models using the information of all other locations (single entity ap-
proach) is able to outdo the models using only information of the location which
values it is seeking to predict.
Figure 3.3 displays the one-step forecasts against the true values for 10h in one
location, for the ARMA(1,5) component-wise model, and the repeat baseline for
the wind speed and direction. In this figure, the predictions from the ARMA
model are close to the ones made by the last baseline.
3.3 Discussion
In Erdem 2011 [4], the order of the mean results of MAE of speed and direc-
tion over locations is higher than the results we obtained. It would have been
interesting to have the last baseline benchmark from their data to compare the
performance of the models. Anyway, this phenomenon highlights the high vari-
ance of the wind speed and direction distributions depending on the location.
22
speed-average: Forecasts vs Actuals
True value
11 Last baseline
ARMA(1,5)
10
m/s
00:00 02:00 04:00 06:00 08:00 10:00

26-Nov
direction-average: Forecasts vs Actuals
True value
24 Last baseline
ARMA(1,5)
22
20
18
deg
16
14
12
10
00:00 02:00 04:00 06:00 08:00 10:00

26-Nov
Forecasting horizon
Figure 3.3: Forecasts against actuals for short-term prediction
Our final dataset was composed of wind data for only 3 months. The distribution
of the wind speed and direction could differ for different months or years. Indeed,
this thought was consolidated by the results that were varying depending on the
train-test split used to train the models. For example, for the single entity CNN
model, the results had a 0.03 difference depending on the training/test split and
a difference of almost 0.1 for the VAR(6) model.
To enhance the results of the ARMA-based models, for the wind direction, we
could use a link function to do the conversion between a linear variable and a
circular variable [14].
23
True values
Location-wise forecast
Single entity forecast
2
1
Sinus speed (m/s)
00:00 03:00 06:00 09:00 12:00 15:00 18:00 21:00 00:00

Forecasting horizon
Figure 3.4: Long term forecasts of VAR models
The objective function minimized by the ARMA and VAR models is not the MDE,
but the mean squared error. One could change the different modules used in this
research to minimize our hand-made objective function.
Some locations in our dataset are separated by hundreds of kilometers and are
very unlikely that their time series causes one another. Hence, another approach
to this multiple locations framework could be to build a specific model for clus-
ters of locations close to each other, regarding the Euclidean distance. This could
also be done by defining a notion of distance based on a causation test statistic.
Further research could be done on algorithms to build appropriate clusters for
multiple locations.
Classical ARMA and VAR models are typically well-suited for short-term fore-
casts, but not for longer-term forecasts due to the convergence of the auto-
regressive part of the model to the mean of the time series, as it is shown in
Figure 3.4. On top of that, the implemented version of these models is not con-
ceived to minimize an objective function for a forecast horizon strictly superior
to 1. Hence, if we increase the forecast horizon for the VAR and ARMA models,
the error metrics will increase promptly, like in Figure 3.5. Furthermore, when
we increase the lag in the models, the number of parameters is increased pro-
portionally to the number of variables used as inputs which highly expands the
computational time. Therefore, for the multi-step forecasting part, we focus our
research on deep learning models.
24
VAR Location-wise VAR Single entity
0
00:00 03:00 06:00 09:00 12:00 15:00 18:00 21:00 00:00 00:00 03:00 06:00 09:00 12:00 15:00 18:00 21:00 00:00
Forecasting horizon Forecasting horizon
MDE MAE speed MAE direction
Figure 3.5: Error metrics for long term forecasts of VAR models
25
Chapter 4
Multi-step deep learning models
4.1 Methodology
Computations for the deep learning models are made using the Python library
TensorFlow . A TensorFlow tutorial to perform time series forecasting is available
here. We will follow this tutorial while adapting it to our dataset and needs. Our
dataset requires to be converted to windows of consecutive data before training
the models (see Figures 4.1 and 4.5). Depending on the approaches, the window
dataset has to be built distinctly.
As explained in subsection 3.1.4, a validation set is required to tune the param-

eters of the deep learning models. Hence, now the training/validation/test split
used is 0.7/0.1/0.2. Since we need to have consecutive values in each of these sets,
we have 6 possible sets of training, validation, and test sets. We will train the
models on only 4 of them to ensure that the test sets are the same as the ones in
Chapter 3. Those 4 possibilities are the 2 where the test set is the first 20% of
the data and the 2 where the test set is the last 20% of the data.
In this chapter, we are interested in multi-step prediction. In this framework, the

models need to learn to predict a range of future values. Thus, unlike a single-step
model, like in Chapter 3, where only a single future point is predicted, a multi-
step model predicts a sequence of the future values. The forecasts are computed
for 12 hours and 24 hours ahead, which corresponds to 72 and 144 data points
respectively. The models will use as input the same size of the range of data than
the range of the output. This means that the models will learn to predict 12/24
hours into the future, given 12/24 hours of past values. We will focus on models
using single-shot predictions, where the entire time series forecast is predicted at
once.
26
There exists a large number of ANN that could be relevant to forecast multi-
variate time series similar to our dataset. We decide to fixate on two different
architectures of neural networks: a convolution neural network (CNN) and a re-
current neural network with a layer called Long Short-Term Memory (LSTM). A
recurrent model can learn to use a long history of inputs if it’s relevant to the
predictions the model is making. Here the model will accumulate internal state
for 12/24 hours, before making a single prediction for the next 12/24 hours.
Both models have different hyper-parameters including the maximum number of

epochs, the batch size, and the number of neurons in the hidden layer. Here
the hidden layer is either the convolutional layer or the LSTM layer. We will
train each deep learning model, using a maximum number of epochs of 100. If
the validation loss is not enhanced during more than 10 epochs, then the fitting
procedure is stopped and the epoch with the best validation loss is kept. We use a
batch size of 128 for all models. From Hagan 1997 [15], a simplified rule of thumb
formula can be deducted to have a rough estimate of the number of neurons to
use in the hidden layer of an ANN with one hidden layer to avoid overfitting:
Ns
Nh =
α × (Ni + No )
This means that if the number of output time steps increases, then the number
of neurons in the hidden layer should decrease to avoid overfitting. We base our
choice of the number of neurons in the hidden layer on this formula and also on
plot analysis of train and valid losses.
Usually to train deep learning models, the most common loss function to use is
the mean squared error. Based on this, we use the MSDE as our loss function.
We will also report the MDE and the MAE for wind speed and direction as error
metrics.
To assess the quality of performance of the models, we will also compute the
metrics for 2 simple baselines:
• Last baseline: repeat the last input time step for the required number of
output time steps.
• Repeat baseline: repeat the entire input time series as the output time series.
In the single-step prediction, repeat and last baselines were equivalent.
27
4.1.1 Location-wise approach
Here, we build a window dataset per location. For each one of them, we train
specific deep learning models with 2 features (cosine and sinus speed) as input
and 2 features as output.
We add a normalization layer to each artificial neural network with the mean
and standard deviation of each feature being calculated from the training set of
the specific location. The mean and standard deviation should only be computed
using the training data so that the models have no access to the values in the
validation and test sets.
For single-step prediction, we use 128 neurons in the hidden layer of both neural
networks, and 32/16 neurons for 12/24 hours of multi-step forecasting.
4.1.2 Single distribution approach
One could argue that the distribution of each time series is similar for all loca-
tions. In this case, we need to build the window dataset using time series of all
locations. To ensure that each batch is containing only consecutive values of data
from the same location, we perform first the training/validation/test split and
division into batches separately on each location and then we concatenate the
batches. Following this procedure, we will also make sure that the batches are the
same as in the location-wise approach for each training, validation, and test set.
Now we only have to train each deep learning model once, on a window dataset
with 2 features with more input samples than for the location-wise approach.
Regarding normalization, we also add a normalization layer to each artificial neu-

ral network with the mean and standard deviation of each feature being the one
of the whole training set composed of the values of all locations.
We use the same number of neurons in the hidden layer as in the location-wise
approach.
4.1.3 Single entity approach
For metric constructions reasons, the single entity approach is slightly different
than the one followed in Chapter 3. Indeed, what we consider as a single entity
28
here is only the input. The information of all locations is used as input, but the
output is only the 2 features of a specific location. Hence, we need to train differ-
ent models for each location, using each time the input of all locations. Therefore,
the window dataset is built similarly to the location-wise approach, except that
the input has now 96 features.
Like for the location-wise approach, we also add a normalization layer to each
artificial neural network with the mean and standard deviation of each feature
being the one of the training set.
Since the number of samples is reduced compared to the single distribution ap-
proach, and the risk of overfitting augmented with the increase of input features,
the deep learning models used have to have fewer neurons in their hidden layer.
For one-step-ahead prediction, there are 64 neurons, and for 72/144 steps ahead
there are 16/8 neurons in the hidden layer.
4.2 Performance analysis
Approach Model MSDE MDE MAE speed MAE direction

Last 17.001 3.236 1.926 0.247
Repeat 27.974 4.359 2.521 0.724
Location-wise LSTM 18.480 3.424 2.273 0.436
Location-wise CNN 15.017 3.064 2.098 0.287
Single distribution LSTM 15.100 3.070 2.128 0.284
Single distribution CNN 15.063 3.081 2.194 0.260
Single entity LSTM 23.207 3.918 2.633 0.730
Single entity CNN 17.145 3.345 2.119 0.581
Table 4.1: 12 hours forecasts performance
Table 4.1 displays the error metrics on the test sets, rounded to the third decimal,
obtained by each deep learning model for each approach followed. First of all,
we have multiplied the number of forecasting steps by 72 since the single-step
predictions. However, the error metrics increase by a factor between 2 and 5 de-
pending on the models and the metric. Now, regarding 12 hours predictions only,
the important gap of performance values between the last and repeat baselines
could suggest that the last value of the time series itself gives more information
on the future values 12 hours ahead than the 12 past hours. This gap is especially
significant for the direction error. The last baseline has also the best performances
regarding the MAE of speed and direction. However, 3 of the deep learning mod-
els were able to best the last baseline for the MSDE and MDE: LSTM network
29
and CNN for the single distribution approach and CNN for the location-wise ap-
proach. The metrics for location-wise and single distribution strategies are close
for the CNN model. Nevertheless, this is not the case for LSTM networks which
have better performance in the single distribution approach. Once again, the sin-
gle entity has the worst performance results, even if the MAE of the speed of the
CNN model is encouraging. For 12 hours predictions, which corresponds to 72
steps, the CNN model appears to be a suitable model to use against the LSTM
network.
Approach Model MSDE MDE MAE speed MAE direction

Last 25.202 3.946 2.322 0.250
Repeat 37.655 4.822 2.823 0.723
Location-wise LSTM 24.126 3.921 2.667 0.522
Location-wise CNN 21.384 3.657 2.547 0.309
Single distribution LSTM 21.302 3.659 2.565 0.308
Single distribution CNN 21.495 3.692 2.700 0.274
Single entity LSTM 27.688 4.273 2.951 0.747
Single entity CNN 23.161 3.864 2.522 0.654
Table 4.2: 24 hours forecasts performance
For 144 steps forecasting (see Table 4.2), the proportional gap between last and
repeat baselines has been reduced, especially in terms of the wind speed MAE.
Even if the number of forecasting steps was doubled, the error metrics have only
been increased by a factor much smaller than 0.5 for most of them. The results
are consistent with the previous ones obtained for 12 hours in terms of ranking of
approaches and artificial neural networks. The last baseline has the best perfor-
mances regarding the MAE of speed and direction. However, like for the 12 hours
predictions framework, the same three deep learning models were able to have
MSDE and MDE below the last baseline benchmark. Between all deep learning
models, single entity CNN had the best results in terms of the MAE of speed. This
is encouraging since its MSDE and MDE results are also close to the top 3 models.
The single distribution approach has shown better results both for 12 hours and
24 hours forecasts. This could be due to a lack of data per location for both other
alternatives when using a large windows dataset like here. Given the performance
of the CNN in the location-wise approach, we would suggest using this model as
the first choice when using a larger dataset than ours.
30
Figures 4.1, 4.2, 4.3 and 4.4 show 3 different single-batches of the cosine speed fea-
ture of a window dataset constructed for 12 hours forecasting for a given location.
We can compare the performance for this specific batch between the location-wise
CNN and both last and repeat baselines. For the 3 windows, the CNN model
seems to have learned to use the previous values to predict the pattern of future
values proficiently. Similarly, Figures 4.5, 4.6, 4.7 and 4.8 show 3 different single-
batches of the cosine speed feature of a window dataset constructed for 24 hours
forecasting for a given location. The same interpretation as for 12 hours predic-
tions can be made for the CNN location-wise model here. It is also noticeable
that the repeat baseline forecasts are either very close (second and third window)
or are completely distant (first window) from the true values.
15.0 Inputs
Labels
12.5
10.0
7.5
5.0
0 5 10 15 20 25
5
0
5
10
15
0 5 10 15 20 25
14
cosine-speed (in m/s)
12
10
8
6
4
0 5 10 15 20 25
Time (in hours)
Figure 4.1: Inputs and labels - 12 hours forecasting
31
15.0 Inputs
Labels
12.5 Predictions
10.0
7.5
5.0
0 5 10 15 20 25
5
0
5
10
15
0 5 10 15 20 25
14
12
10
8
6
4
0 5 10 15 20 25
Time (in hours)
Figure 4.2: CNN location-wise 12 hours forecasts
15.0 Inputs
Labels
12.5 Predictions
10.0
7.5
5.0
0 5 10 15 20 25
5
0
5
10
15
0 5 10 15 20 25
14
12
10
8
6
4
0 5 10 15 20 25
Time (in hours)
Figure 4.3: Last baseline 12 hours forecasts
32
15.0 Inputs
Labels
12.5 Predictions
10.0
7.5
5.0
0 5 10 15 20 25
5
0
5
10
15
0 5 10 15 20 25
14
12
10
8
6
4
0 5 10 15 20 25
Time (in hours)
Figure 4.4: Repeat baseline 12 hours forecasts
0
2
4
6
Inputs
8 Labels
0 10 20 30 40 50
2
0
2
4
6
0 10 20 30 40 50
7.5
5.0
2.5
0.0
2.5
0 10 20 30 40 50
Time (in hours)
Figure 4.5: Inputs and labels - 24 hours forecasting
33
0
2
4
6 Inputs
Labels
8 Predictions
0 10 20 30 40 50
2
0
2
4
6
0 10 20 30 40 50
7.5
5.0
2.5
0.0
2.5
0 10 20 30 40 50
Time (in hours)
Figure 4.6: CNN location-wise 24 hours forecasts
0
2
4
6 Inputs
Labels
8 Predictions
0 10 20 30 40 50
2
0
2
4
6
0 10 20 30 40 50
7.5
5.0
2.5
0.0
2.5
0 10 20 30 40 50
Time (in hours)
Figure 4.7: Last baseline 24 hours forecasts
34
0
2
4
6 Inputs
Labels
8 Predictions
0 10 20 30 40 50
2
0
2
4
6
0 10 20 30 40 50
7.5
5.0
2.5
0.0
2.5
0 10 20 30 40 50
Time (in hours)
Figure 4.8: Repeat baseline 24 hours forecasts
4.3 Discussion
With enough computational resources, the metrics could be computed for differ-
ent values of batch size and the number of neurons in the hidden layer. This could
confirm the performance, or even maybe enhance it for some models. Moreover,
an intrinsic modification on the artificial neural networks could be done: adding a
dropout layer in the network, to prevent overfitting. The role of the dropout layer
is to set randomly inputs units to 0 based on a given frequency. This option could
be particularly useful in the case of the single entity approach. Instead of using
single shots models and predicting the entire output sequence in a single step, it
may be helpful for the model to decompose this prediction into individual time
steps. Then, each model’s output can be fed back into itself at each step, and
predictions can be made conditioned on the previous one. We call these multi-step
deep learning models, auto-regressive deep learning models.
There exists an infinite number of artificial neural networks more or less adapted
to forecasting multivariate times series. CNN and LSTM networks are common
artificial neural networks, more complex ANN could be more appropriate to the
multi-locations purpose. For instance, Barbounis 2007 [16] focused their research
on a locally feedback dynamic fuzzy neural network trained on data from 3 sta-
tions. One might adapt this network for the framework presented in this the-
35
sis. Zhu 2018 [17] also proposed a model for wind speed prediction with spatio-
temporal correlation: the predictive deep convolutional neural network (PDCNN).
Unfortunately, for predictions up to 24 hours, our dataset might be too small.

Indeed it is composed of only 105 days, which is not much even for 48 different
locations. For 24 hours, for one location, the windows dataset is composed of con-
secutive windows of size 2×144 = 288 since we use the past 24 hours to predict the
next 24 hours. There are 15 624 values for both features and for every 48 locations
hence we have only 54 windows available for the 24 hours predictions. It would
be interesting to benchmark the performance of similar models for a larger dataset.
36
Concluding remarks
In this study, multiple models for forecasting wind velocity vectors were proposed
for single and multi-step predictions in multiple locations. They were evaluated
using at least three metrics: the MDE of the velocity vector, the MAE of speed,
and the MAE of direction. By analyzing the forecasting results, we can summa-
rize the following. None of the models, in both frameworks, were able to have
performance metrics significantly better than the last baseline. Nevertheless, in
terms of MDE, we suggest using an ARMA model per feature for short-term pre-
dictions. For multi-step forecasting, the more appropriate deep learning model
to use is a CNN following either the location-wise or the single distribution ap-
proach, depending on the amount of data available. The intrinsic spatial relation
of the dataset was either not enough significant or the models introduced were
not able to learn from it. Based on the results, we are still convinced that with
a larger dataset and more time to investigate more models, the single entity ap-
proach should enhance the results over the location-wise approach. Those findings
could be significant for the research on wind velocity vector forecasting and could
stimulate follow-up studies in the future.
37
Bibliography
[1] W. Chang, “Short-term wind power forecasting using epso based hybrid
method. energies, 6, 4879-4896,” 2013.
[2] W.-Y. Chang et al., “A literature review of wind forecasting methods,” Jour-
nal of Power and Energy Engineering, vol. 2, no. 04, p. 161, 2014.
[3] J. Palomares-Salas, J. De La Rosa, J. Ramiro, J. Melgar, A. Aguera, and

A. Moreno, “Arima vs. neural networks for wind speed forecasting,” in 2009
IEEE International Conference on Computational Intelligence for Measure-
ment Systems and Applications. IEEE, 2009, pp. 129–133.
[4] E. Erdem and J. Shi, “Arma based approaches for forecasting the tuple of
wind speed and direction,” Applied Energy, vol. 88, no. 4, pp. 1405–1414,
2011.
[5] M. Bilgili, B. Sahin, and A. Yasar, “Application of artificial neural networks

for the wind speed prediction of target station using reference stations data,”
Renewable Energy, vol. 32, no. 14, pp. 2350–2360, 2007.
[6] R. Mushtaq, “Augmented dickey fuller test,” 2011.
[7] C. Hiemstra and J. D. Jones, “Testing for linear and nonlinear granger causal-
ity in the stock price-volume relation,” The Journal of Finance, vol. 49, no. 5,
pp. 1639–1664, 1994.
[8] D. C. Montgomery, C. L. Jennings, and M. Kulahci, Introduction to time

series analysis and forecasting. John Wiley & Sons, 2015.
[9] R. Bellman, “Dynamic programming,” Science, vol. 153, no. 3731, pp. 34–37,
1966.
[10] A. R. Liddle, “Information criteria for astrophysical model selection,” Monthly

Notices of the Royal Astronomical Society: Letters, vol. 377, no. 1, pp. L74–
L78, 2007.
[11] K. O’Shea and R. Nash, “An introduction to convolutional neural networks,”

arXiv preprint arXiv:1511.08458, 2015.
38
[12] Y. Yu, X. Si, C. Hu, and J. Zhang, “A review of recurrent neural networks:
Lstm cells and network architectures,” Neural computation, vol. 31, no. 7, pp.
1235–1270, 2019.
[13] W. Nicholson, D. Matteson, and J. Bien, “Bigvar: Tools for modeling sparse
high-dimensional multivariate time series,” arXiv preprint arXiv:1702.07094,
2017.
[14] N. I. Fisher, Statistical analysis of circular data. cambridge university press,

1995.
[15] M. T. Hagan, H. B. Demuth, and M. Beale, Neural network design. PWS

Publishing Co., 1997.
[16] T. Barbounis and J. B. Theocharis, “A locally recurrent fuzzy neural network

with application to the wind speed prediction using spatial correlation,” Neu-
rocomputing, vol. 70, no. 7-9, pp. 1525–1542, 2007.
[17] Q. Zhu, J. Chen, L. Zhu, X. Duan, and Y. Liu, “Wind speed prediction with
spatio–temporal correlation: A deep learning approach,” Energies, vol. 11,
no. 4, p. 705, 2018.
[18] M. Lei, L. Shiyan, J. Chuanwen, L. Hongling, and Z. Yan, “A review on the

forecasting of wind speed and generated power,” Renewable and sustainable
energy reviews, vol. 13, no. 4, pp. 915–920, 2009.
39

Master Thesis Jimmy

Uploaded by

Copyright:

Available Formats

Master Thesis Jimmy

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Master Thesis Jimmy

Uploaded by

Copyright:

Available Formats

Forecasting wind velocity vector for

École Polytechnique Fédérale de Lausanne

Under the supervision of

Prof. Petros Dellaportas

University College of London - Athens University of Economics and Business -

1.1 Time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2 (V)ARMA models . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3 Artificial neural networks . . . . . . . . . . . . . . . . . . . . . . . 9

2.1 Data source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Data filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Data pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4 Statistical description . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.5 Performance measures . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1.1 ARMA models . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1.2 VAR models . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.1.3 Penalized VAR models . . . . . . . . . . . . . . . . . . . . 19

3.1.4 Deep learning models . . . . . . . . . . . . . . . . . . . . . 20

3.2 Performance analysis . . . . . . . . . . . . . . . . . . . . . . . . . 20

4 Multi-step deep learning models 26

4.1.1 Location-wise approach . . . . . . . . . . . . . . . . . . . . 28

4.1.2 Single distribution approach . . . . . . . . . . . . . . . . . 28

4.1.3 Single entity approach . . . . . . . . . . . . . . . . . . . . 28

4.2 Performance analysis . . . . . . . . . . . . . . . . . . . . . . . . . 29

1.1 Time series

1.2 (V)ARMA models

Definition 1.5. An Auto-Regressive Moving Average Process with parameters

BIC = −2lnLmax + k lnN

Definition 1.10. The number of epochs is a hyper-parameter that defines the

Definition 1.12. Convolutional neural networks (CNN) are used in applications

Figure 2.1: Map with location of wind turbines in the dataset

Different information on the wind was given to us by the International Wind

2.2 Data filtering

Unfortunately, most of the data is missing or incoherent. Hence, we have to filter

2.3 Data pre-processing

2.4 Statistical description

Figure 2.2: Heat map distribution of the dataset

Figure 2.3: Wind speed density plot

2.5 Performance measures

Figure 2.4: Wind direction density plot

Similarly, we define the Mean Squared Distance Error (MSDE) as:

3.1.1 ARMA models

Partial Autocorrelation Partial Autocorrelation

(a) Cosine speed (b) Sinus speed

Figure 3.1: PACF for one location

(a) Cosine speed (b) Sinus speed

Figure 3.2: ACF for one location

3.1.2 VAR models

p AIC BIC p AIC BIC

Location-wise Single entity

Table 3.3: VAR Order Selection

3.1.3 Penalized VAR models

Nicholson 2017 [13] have implemented a package on R that includes regularization

To estimate the optimal penalty parameter λ, cross-validation is conducted in

3.2 Performance analysis

Table 3.4: Short-term forecasts performance by model

Overall, it is complicated to give a relevant conclusion on which model is the most

00:00 02:00 04:00 06:00 08:00 10:00

00:00 02:00 04:00 06:00 08:00 10:00

Figure 3.3: Forecasts against actuals for short-term prediction

00:00 03:00 06:00 09:00 12:00 15:00 18:00 21:00 00:00