Memoria
Memoria
Memoria
Author: Supervisor:
Manel Maragall Cambra Jordi Vitrià
September 2, 2018
iii
UNIVERSITAT DE BARCELONA
Abstract
Facultat de Matemàtiques i Informàtica
MSc
One of the main concerns of the manufacturing industry is the constant threat of
unplanned stops. Even if the maintenance guidelines are followed for all the com-
ponents of the line, these downtimes are common and they affect the productivity.
Most of what is done nowadays in the manufacturing plants involves classic statis-
tics, and sometimes online monitoring. However, in most of the industries the data
related to the process is monitored and saved for regulatory purposes. Unfortu-
nately it’s barely used, while the actual technologies offer a wide horizon of possi-
bilities.
The time to an event is a primary outcome of interest in many fields e.g., medical
research, customer churn, etc. And we think that it’s also very interesting for Predic-
tive Maintenance. The time to an event (or in this context time to failure) is typically
positively skewed, subject to censoring, and explained by time varying variables.
Therefore conventional statistic learning techniques such as linear regression or ran-
dom forests don’t apply. Instead we have to relate on more complex methods.
Acknowledgements
First of all, I would like thank Professor Jordi Vitrià to find some time to be my
project advisor. His remarks and suggestions have been of great importance to guide
the course of this thesis. Also I would like to express my gratitude to Egil Martins-
son, for not only having a great idea but also being patient and accessible to my
constant questions.
I would also like to thank Llorenç Domingo and my other colleagues from Bigfinite
for their support, and for offering me the flexibility I needed to attend the program
during these last two years.
I’m also very grateful to my parents Cristina and Manel, they have always been
there when I needed a boost. I know reading through this text has not been easy.
Finally I’ll always owe one to my partner Belén, she has made sure that I get ahead
with the thesis and the whole MSc. Thank you very much!
vii
Contents
Abstract iii
Acknowledgements v
1 Introduction 1
1.1 Censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The Weibull Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 Long Short Term Memory Networks . . . . . . . . . . . . . . . . 5
1.3.2 Gated Recurrent Unit . . . . . . . . . . . . . . . . . . . . . . . . 6
4 Methodology 15
4.1 Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Validation Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5 Experimentation 19
5.1 Data preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.2 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2.1 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.3 Adapting to WTTE-RNN . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.3.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.3.2 GRU variant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6 Discussion 31
6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Bibliography 33
1
Chapter 1
Introduction
There are a few concepts that are relevant to comprehend the work that has been
done in this project. These are mostly statistic notions that will be addressed by
complexity, aiming for a simple and schematic overview. Deepening into the theo-
retical basis of some of these concepts only when needed.
Since we will exclusively focus on the branch of Recurrent Neural Networks (RNN),
I suggest a chronological approach (Nielsen, 2015) for readers who are not familiar
with Deep Learning. In Nielsen’s blog, the basic units that conform a Multi-Layer
Perceptron are introduced in an understandable way.
1.1 Censoring
Within statistics, censoring is defined as a condition in which the value of a mea-
surement or observation is only partially known. This turns out to be a common
phenomena in many fields, such as Health, Life Sciences, Engineering, etc. And of
course, when studying the waiting time of an event, it’s possible and even likely,
that such event has not been observed from the start to the end.
Besides an infinite spike or an infinite flat probability density function f , the Weibull
Distribution can relate to many other statistic distributions (Equation 1.1). When
modelling the time to failure, it provides a distribution for which the failure rate is
proportional to a power of time.
α −1 − x α
x
x≥0
α
f (x) = β β e β (1.1)
0 x<0
The hazard function is also known as hazard rate, failure rate in the field of reliability
engineering, and force of mortality µ in demographics. An extension of this idea is the
4 Chapter 1. Introduction
accumulation of the hazard over time, a.k.a the cumulative hazard function Λ.
Z t α
t
Λ(t) = λ(u)du = − ln S(t) = (1.5)
0 β
As it can be seen, we can express the probability density function f and the cumula-
tive distribution function F of the Weibull Distribution through these concepts.
f ( t ) = λ ( t ) · S ( t ) = λ ( t ) · e−Λ(t) (1.6)
F ( t ) = 1 − S ( t ) = 1 − e−Λ(t) (1.7)
Additionally, the cdf of the Weibull Distribution is invertible2 and therefore the
quantile3 has a closed form.
1
Q( p) = β(− ln(1 − p)) α (1.8)
Finally, these are the forms for the mean, variance, mode and median.
f applied to an input x gives a result of y, then applying its inverse function g to y gives the result x,
and vice versa, i.e., f ( x ) = y ⇐⇒ g(y) = x.
3 The quantile can be used to sample the distribution by taking uniform samples of a number u,
In simple terms, the differential property that RNNs incorporate, and traditional net-
works don’t, are loops. In fact, these loops within them keep the short term memory
flowing. Cristopher Olah (Understanding LSTM Networks) provides an excellent ex-
planation of these ideas, the chapter is heavily borrowed from his blog and so are
the figures (1.4, 1.5, 1.6).
Each of the modules, sometimes referred as cells, looks something like Figure 1.5 and
there is a lot going on inside. Again, you can find a detailed explanation in the blog.
In any case, I will try to provide a basic overview; given a moment of time t, that
represents the th event in the sequence. Ct is the horizontal line running through the
top of the diagram. Along this line it’s easy for the information to flow unchanged,
but it can also be altered by the gates. The first gate from the left f t is the forget layer,
and decides how much of the information coming from the previous cell will per-
sist. On the other hand, the product of C̃t and it takes care off the new information.
Basically, these are new candidate values scaled by how much we want to update
each state value. Thus:
Ct = f t · Ct−1 + it · C̃t (1.13)
The first part f t · Ct−1 of the Equation 1.13 gets rid off previous information, and
it · C̃t adds new values. Finally, we decide what are we going to output ht , based on
a filtered version of the cell state:
ht = σt · tanh(Ct ) (1.14)
6 Chapter 1. Introduction
Notice that the inner state Ct , the output mask σt and the output state ht should
have the same dimension. The amount of units of these layers depends on the com-
plexity of the problem that we are trying to model. Therefore the selection of these
units, along with the number of passed events to observe, are very important hyper-
parameters for the model.
Although LSTM and GRU cells have achieved outstanding results, it’s reasonable
to think that there exist alternative architectures that could obtain better results. In
fact, some have ventured into finding new proposals (Jozefowicz, Zaremba, and
Sutskever, 2015), by evaluating over 10.000 RNN architectures.
7
Chapter 2
Until recently, an issue in pharma meant throwing the cake to the trash. This can
be measured in the order of millions of dollars in some cases. Fortunately, the Food
and Drug Administration (FDA) and the other agencies are starting to allow "sci-
ence" during the process. This also allows for a wide variety of mechanisms to pro-
vide real time information about the process, and more importantly, for decisions to
be made.
∑ GoodUnits
Quality = (2.1)
TotalUnits(∑ GoodUnits + ∑ Scraps)
∑ NonPlannedStops
Availability = 1 − (2.2)
PlannedProductionTime
∑ SmallStops + ∑ ReducedSpeedLoss
Per f ormance = 1 − (2.3)
PlannedProductionTime
As we can see in the Equation 2.1, the OEE is greatly affected by the scraps, which
are the rejected units. Scraps are typically related to unplanned stops in the line.
Knowing in advance when there is going to be a failure (time to failure) can advert
8 Chapter 2. Motivation and Goals
the line operators so they prevent it, or at least prepare for a response. This of course
may require more context, for instance what components are going to fail, or even
what caused the failure. In any case, predicting the failures can be very interest-
ing for this classical industry that is just starting to grasp the potential of Machine
Learning.
In the last decade, the availability of industrial internet of things (IIoT) devices has
made it possible to monitor the machine continuously with wireless sensors, in order
to assess the degradation of the components and predict the failures ahead of time.
This new ways to monitor the equipment condition continuously (Online Monitor-
ing), bring new lines of research to the manufacturing industries (Amruthnath and
Gupta, 2018) that most likely will redefine how things are done.
The goal therefore is to model time to failure in order to solve the needs for Pre-
dictive Maintenance. These are the desirable properties of the model:
• Confidence: Avoid costs for false positives. Somehow measure the uncertainty
of the prediction.
9
Chapter 3
Time To Failure (TTF) has been tackled in many ways over the years. I present here
a short selection of these techniques that approach the problem of Predictive Main-
tenance from different perspectives. We will see that the methods that take into
consideration the possibility of having censored data (chapter 1.1), usually involve
some cumbersome workarounds. Lastly we will focus on the Weibull Time to Event
(Martinsson, 2016). It will be explained on chapter 3.4, but it basically approaches
TTF in a very natural, powerful and flexible way.
The control chart (Figure 3.1) shows the value of the quality characteristic versus the
number of produced units, or sometimes versus the time. In general, the chart con-
tains a center line that represents the mean value for the in-control process, which
10 Chapter 3. State of the Art
Where λ0 (t) is the baseline hazard function, and it can be regarded as the hazard
function of an individual whose covariates have all values of zero. The parameters
of the Cox model can be interpreted in the following way:
• e β j represents the hazard ratio for one unit increase in x j , with all other covari-
ates held constant.
• β k < 0 means that if x j increases the risk (hazard) decreases.
• β k > 0 means that if x j increases the risk (hazard) increases.
However, Cox also noted that the interpretation of the proportional hazards assump-
tion can be quite tricky (Reid, 1994). An alternative to the Proportional Hazard mod-
els is the AFT model, which instead assumes that the effect of a covariate is to accel-
erate or decelerate the life course of a failure by some constant. This appears to be
more suitable for mechanical processes.
3.3. Sliding Window 11
The binary variant of the sliding window approach has been used before in customer
churn (XIA and JIN, 2008). And it is useful because of its simplicity; formulations
like the customer will stay with us in the next period of duration d are straightfor-
ward. In other fields this idea can be adapted into a multiclassification problem or
even a regression, yet the inference is still limited to predicting one time window
ahead. Forward-looking constructions are especially troublesome.
The modelling of censored data could be done with a wide variety of distribu-
tions: Beta, Gamma, Exponential, Poisson, etc. But Martinsson’s work focuses on
the Weibull Distribution because it is:
• Empirically feasibe
12 Chapter 3. State of the Art
• Easily discretized
• Regularizable
• Numericaly stable
Although the Weibull Distribution has some great properties, the framework can be
extended to support other distributions and adapted for multivariate prediction.
In fact, Martinsson also examines the behaviour of the Weibull Distribution with
censored data. In the Figure 3.3, the event is right censored, so we know it’s going
to happen after time t. Notice that the measure of uncertainty that we get from an
statistic distribution is a very useful property for the Predictive Maintenance use-
case. Specially in the context of the pharmaceutical industry, since it is a highly
regulated industry. Actions on the line, besides having a cost, have to be justified.
Hence it’s very important to know when the line is going to fail, and how sure we
are about that.
the likelihood.
Finding the maximum of a function often involves taking the derivative of a function
and solving for the parameter being maximized. Although the differentiation will
be done by the neural network, it is often easier when the function being maximized
is the natural logarithm (ln) of the likelihood (log-likelihood). This is because the
likelihood function of a collection of statistically independent observations factors
into a product of individual likelihood functions. The logarithm of this product is a
sum of individual logarithms, and the derivative of a sum of terms is often easier to
compute than the derivative of a product.
Since the Time To Failure can only be right censored, we will avoid specifying how
the loss would look for left censored data. Assuming that our statistic model f (·; θ )
is the Weibull Distribution and therefore θ = (α, β). Let (t, u) be an observation with
u the failure indicator s.t u = 1 means that we have an uncensored observation and
u = 0 a right censored observation, where:
• d ( t ) = Λ ( t + 1) − Λ ( t )
The unconstrained optimization problem in the discrete case is then to find w maxi-
mizing the log-loss:
T αt αt
y t + 1 αt
yt + 1 yt
maximize ln(L(w, y, u, x )) := ∑ (ut · [exp[ − ] − 1] − )
w βt βt βt
t =0
(3.18)
15
Chapter 4
Methodology
At the end, I decided to focus on the Turbofan Engine Degradation Simulation Data
Set (Saxena and Goebel, 2008). Although this is a simulated dataset, it comes from
one of the simulators at NASA Ames, CA. In particular, this simulator is called C-
MAPSS, which stands for Commercial Modular Aero-Propulsion System Simula-
tion, and it is a tool for the simulation of realistic large commercial turbofan engine
data.
There is a paper about the generation of the data set (Saxena et al., 2008), basically
it consists of multiple multivariate time series. Each time series is from a different
engine i.e., the data can be considered to be from a fleet of engines of the same type.
Each engine starts with different degrees of initial wear and manufacturing varia-
tion which is unknown to the user. This wear and variation is considered normal,
i.e., it is not considered a fault condition. There are three operational settings that
have a substantial effect on engine performance. These settings are also included in
the data. The data is contaminated with sensor noise.
16 Chapter 4. Methodology
The data is provided as a zip-compressed text file with 26 columns of numbers, sep-
arated by spaces. Each row is a snapshot of data taken during a single operational
cycle, each column is a different variable. The columns correspond to:
• Unit number
• Time (cycles)
The data is partitioned in four subsets: FD001 with 1 operating conditions and 1
failure mode, FD002 with 6 operating conditions and 1 failure mode, FD003 with
1 operating condition and 2 failure modes, and FD004 with 6 operating conditions
and 2 failure modes. The 2 failure modes correspond to the fan degradation and the
high-pressure compressor degradation (check Figure 4.1), whereas the 6 operating
conditions are a combination of altitude, flight speed and TRA.
The engine is operating normally at the start of each time series, and develops a
fault at some point during the series. If we observe Figure 4.2, we can see that the
engine 1 was monitored until the cycle 192 when the failure occurred. The objec-
tive is to predict the number of remaining operational cycles before the failure in
each cycle e.g., we would like to know at cycle 172 that there are 20 cycles left to the
failure.
and are not observed until the end, we will additionally separate 20% of the original
train data set as a validation data set . This corresponds to 20 batches of approxi-
mately 200 cycles each , since there are 100 engines in the training set and 100 more
in the test set. The split will be done with a random seed (42) for reproducibility
purposes. We call batch the sequence of observed cycles per engine.
Besides the loss function employed at each experiment, there will be a set of metrics
common in each trial. Notice that the models based on the WTTE-RNN architec-
ture (chapter 3.4) predict the 2 parameters of a Weibull Distribution (chapter 1.2).
With this statistic model the expected value of the number of remaining cycles to the
failure, formally called Remaining Useful Life (RUL), can be estimated in different
ways (next chapter). Therefore there will be some models evaluated several times,
this will be useful to additionally compare these methods to compute the RUL.
r
1 n
RMSE = Σ (yi − ỹi )2 (4.1)
n i =1
1 n
MAE = Σ (yi − ỹi ) (4.2)
n i =1
The Root Mean Squared Error (RMSE) and the Mean Absolute Error (MAE) are
standard metrics for regressions. We refer as error to the difference between the
observed values y and the predicted values ỹ.
In the paper (Saxena et al., 2008) the authors introduce an score s where the penalty
grows exponentially with increasing error. Therefore in this scoring technique late
predictions are more heavily penalized than early predictions, which is an interest-
ing feature for Predictive Maintenance. Even so, it’s difficult to interpret since it’s
likely to produce very high values for relatively small errors. Thus it will not be
included in the evaluation, instead our aim is to include RMSE, MAE and R2 as
metrics for the models.
4.3 Software
Deep Learning is receiving a lot of attention these days and consequently there are
many tools in continuous development. It’s hard to keep up with the good work
that is being done, in order to explore all the frameworks that are popular among
the developers. Instead, we will rely on Google’s project TensorFlow™and it’s API
Keras.
18 Chapter 4. Methodology
In the Figure 4.3 we have list of Deep Learning projects from Github updated in July
2018. As we can see TensorFlow™is the most popular, since besides an open source
software library it’s conceived for high performance numerical computation. Ad-
ditionally its flexible architecture allows easy deployment of computation across a
variety of platforms (CPUs, GPUs, TPUs). On the other hand, Keras is a high-level
neural networks API written in Python capable of running on top of TensorFlow,
CNTK, or Theano. More importantly, it is focused on enabling fast experimentation.
Chapter 5
Experimentation
Finally we present all the experimentation that has been done in this thesis. Even
though the last sections were quite theoretic this chapter will be presented from a
practical perspective, not only showing the results obtained but also focusing in the
relevant parts of the code implementation. That’s why all the code will be posted in
my personal Github Account complemented with some explanations and visualiza-
tions thanks to Jupyter Notebook. Thus, I strongly recommend to visit that resource
if one wants a hands-on approach, the easiest way to start playing with the notebook
is by opening it with Collaboratory.
Just to recapitulate, the scope of this thesis is to use Recurrent Neural Networks (1.3)
in order to predict the time to an event. Which framed in the context of my recent
professional experience translates into predicting the time to failure. In particular
we want to target the WTTE-RNN (3.4) architecture proposed by Martinsson from a
realistic use-case scenario, validating the theoretic work of his thesis and exploiting
the benefits of the Weibull Distribution (1.2) for Predictive Maintenance (2.2).
Even with these constraints about the purpose of the model, there are a few ways
to prepare the data and many alternatives when implementing the Recurrent Neu-
ral Network. If we imagine the data set as a 3D tensor, we distinguish two main
approaches when preparing the data for this problem:
2. Batch Mode: In this case we don’t restrict to a fixed number of past "time
steps", but instead we use all the previous information of the batch. The ad-
vantage here is that the state of the RNN layer is preserved for each batch, and
it is also easy to shuffle the batches in each epoch of the training. To implement
this with Keras we have to apply right padding to each sequence so they all
have the same length, therefore the resulting shape is:
(total number of batches, batch of maximum length, number of features)
Among the possible ways to implement an RNN that we observe in Figure 5.1 we
highlight many-to-one and many-to-many. The main difference is that in many-to-one
we exclusively return the output state of the last unit ht , whereas in many-to-many
we return the states of all the units ht , ht−1 , ht−2 , ..., ht−n .
Because in the Rolling Window case we treat each sequence of look-back period
length independently, many-to-one is the inherent implementation. In Keras there
is a workaround to try to preserve the state of the RNN unit between sequences that
belong to the same batch. It is an infamous API called stateful that it’s not trivial to
use at all. We will not present the work done with this type of data reshaping, there
are others (@daynebatten, @gm-spacagna) that have already done the effort.
There are ways to implement many-to-one for the Batch Mode but these are less
intuitive and more prone for errors. Instead we will focus on the many-to-many
implementation for the Batch Mode, which feels like the more natural option for
the CMAPSS data set. The Keras implementation of this mode involves masks, sam-
ple weights and other techniques that will be explained next.
1 In Machine Learning padding is a technique to extend a sequence to a new longer desired length
5.2 Baseline
We establish a baseline model that uses Recurrent Neural Networks to predict the
Remaining Useful Life. The objective is to select a model that obtains reasonable
results with the validation process that has been established. Then we will adapt the
model to the WTTE-RNN architecture to compare the results.
The model is based on a popular Github Repository (Griffo, 2018) and it basically
consists of two stacked LSTM layers with 100 and 50 units respectively. On top
there is a masking layer that will mask the padding of each engine, this will be
propagated layer-by-layer and eventually applied to the loss. Therefore the padding
placeholders will be skipped in the LSTM and ignored in the loss. Finally there is
a time-distributed layer, which is a keras object that applies the same dense layer
to every time step of the LSTM unit. An important detail with the many-to-many
approach is that it was necessary to use an exponential activation function e x .
5.2.1 Regularization
As it can be seen in the Figure 5.2 there is a large number of parameters (80k) in
comparison with the number of samples (13k coming from 100 engines) so it’s easy
to "learn by hard" the problem. To avoid overfitting2 we include two main regular-
ization techniques Dropout and Early Stopping.
Dropout (Srivastava et al., 2014) is a simple but very effective idea based on ran-
domly ignoring neurons with a certain probability. It turns out that Dropout can
be interpreted as learning many different neural networks, and it’s well known that
better results can generally be obtained when using Ensemble Learning (i.e., multi-
ple independent models). Since each of these models has a different overfitting, it is
prevented by taking an average. The problem is that RNN have connections inside
the layer, and if we cancel RNN units without paying attention to the connections
between these units the noise will be amplified for long sequences drowning the sig-
nal (Zaremba, Sutskever, and Vinyals, 2014).
Therefore we incorporate Recurrent Dropout (Gal and Ghahramani, 2016) to the base-
line, the theory is quite complex and it is based on interpreting dropout as a vari-
ational approximation to the posterior of a Bayesian neural network. In practice
it means masking as well the connections between RNN units in a particular way.
One can get the intuition by checking Figure 5.4, coloured connections represent
dropped-out inputs, with different colours corresponding to different dropout masks.
As for Early Stopping we simply monitor the validation loss and when we detect
that it’s increasing we stop the training. This is typically done by setting a patience
period p, if the validation loss doesn’t overcome the last best score in p epochs the
execution is automatically terminated.
5.2. Baseline 23
5.2.2 Results
We trained the model with Collaboratory (GPU) for 318 epochs (14 secs/epoch) with
an Early Stopping patience of 30 epochs over the validation loss. We use the RM-
SProp optimizer with learning rate set to 0.001 since it’s suggested for Recurrent
Neural Networks in the Keras documentation. The data is scaled (min-max) and it’s
organized in batches (with batch size = 16) so the state of the RNN units is preserved,
the engines are shuffled in each epoch.
In the tests done with the Rolling Window approach we obtained successful results
using a linear function as an activation function for the last dense layer of the model.
However, this produces disastrous results for the Batch Mode with the many-to-
many implementation, where the model fits a horizontal line across all the engines.
Just out of curiosity we implemented the many-to-one for the batch mode, which
implies removing the time-distributed dense layer with 1 neuron, to substitute it
with a dense layer with a number of neurons equal to the batch of maximum length.
In this case the model learned the RUL slope of the average engine and was produc-
ing the same output for all the engines, what seems like a case of overfitting despite
24 Chapter 5. Experimentation
of the regularization techniques being used. Finally it was solved with an exponen-
tial activation function. Probably there are other alternatives that would also work,
we will leave that as future work (chapter 6.1).
In short, we can observe the results of the evaluation (without the padding) in the
table 5.1. Also we can take a loot at Figure 5.7, which plots the predicted RUL of
a selection of engines from the sets of train, validation and test. It seems that the
baseline model it’s learning with significant detail most of the sequences, with a few
exceptions on the test set.
One can also intuit that the model is usually more precise in the 50-ish last cycles
of the sequence. In the next chapter the baseline architecture will be adapted to
the WTTE-RNN model, and we will try to understand what is the behaviour of the
Weibull Distribution across the cycles of an engine.
5.3. Adapting to WTTE-RNN 25
After successfully adapting the baseline to the WTTE-RNN architecture, the next
question is how to compute the expected value of the RUL given the parameters α
and β. The mode (Equation 1.11) seems to be the best choice, specially taking into
account that the Weibull Distribution can be skewed3 . Anyways we additionally
compare the prediction of the RUL with the mean (Equation 1.9) and the median
(Equation 1.12).
3 In probability theory and statistics, skewness is a measure of the asymmetry of the probability
distribution of a real-valued random variable about its mean.
26 Chapter 5. Experimentation
5.3.1 Results
The model was trained under the same conditions than the baseline (chapter 5.2.2).
Although the Early Stopping patience was also set to 30 epochs, the training was
running for 357 epochs at 12 secs/epoch. The times might be a little different be-
cause Collaboratory assigns resources depending on the demand i.e., the computing
power is shared among users.
With the wtte package we set a maximum of 10 to the shape parameter α and initial-
ize the scale parameter β around the mean of the RUL, under Martinsson suggestion.
This seems to make a significant difference to obtain competitive results, otherwise
the model converges slower.
We remove the left padding to evaluate the sequences. Surprisingly the mode per-
forms worst than the other methods, being the mean the best expected value for
the RUL in all the sets. The differences are relatively small which suggests that the
Weibull Distribution stays symmetric during the engine life-cycle. Still the Mode is
the value most likely to be sampled from the distribution, perhaps the reason is that
the mode can produce NaN values when it approaches an infinite spike, and it’s set
to zero when α ≤ 1. At the end we will use the mean as the expected value of the
RUL for visualizations and comparatives.
5.3. Adapting to WTTE-RNN 27
In comparison with the baseline (table 5.1), the WTTE version performs slightly bet-
ter in the test set (table 5.2). Nevertheless the difference is minimum and we can’t say
the model is outstandingly better by means of the evaluation metrics (chapter 4.2).
Instead we have to center the attention in the attributes of the predicted Weibull
Distribution. To begin with, we can examine the shape of the pdf of the Weibull
Distribution as the failure becomes imminent. If we remember the theory from Mar-
tinsson thesis (chapter 3.4.1), we expect the mass of the pdf to be pushed over the
RUL as the event comes closer in time. In order to observe this phenomenon, we
plot the pdf across time for a selection of engines from train, validation and test set.
In Figure 5.11 there are 3 randomly selected engines from each of the sets. The pdfs
of the whole sequence are overlapped, ranging from blue to red through the cycles.
As we explained in the previous chapter 4.2, the test set is right censored which
makes the plots more exemplifying. If we observe these from right to left, we can
get a "step by step" intuition of what’s happening with the pdf.
28 Chapter 5. Experimentation
At the end, it can be seen that the Weibull Distribution behaves as expected. This
opens up to many possibilities to complement the expected value of the RUL. For
instance, the variance (Equation 1.10) or the standard deviation are straightforward
measures that we can get from the distribution. The Survival function is another
interesting measure since it computes the probability that the engine survives the
predicted RUL. We also have the possibility to compute the Confidence Interval of
the Weibull Distribution, which loosely speaking4 quantifies the level of confidence
that the expected value of the RUL lies in the interval. We generated some illustra-
tive gifs incorporating these ideas.
4 More strictly speaking, the confidence level represents the frequency the proportion of possible
confidence intervals that contain the true value of the unknown population parameter (i,e. the ex-
pected value of the RUL).
5.3. Adapting to WTTE-RNN 29
In this case the model was trained locally with a CPU of 1,8 GHz Intel Core i5 and
an 8 GB 1600 MHz DDR3 ram memory. Each epoch took around 4 seconds, which
is approximately 3 times faster than the LSTM baseline. Having said that, gradients
were exploding within the first 50 epochs. It was necessary to incorporate another
tool from the wtte package, which is a scale factor for both the shape and scale pa-
rameters of the predicted Weibull Distribution. It was set to 0.25 in this experiment
and it meant a significant difference since the model was able to run for 386 epochs
without being terminated for exploding gradients.
An Early Stopping patience of 30 epochs was also used in this case. If we take a look
at the table 5.3 we can see that the model fits better the train set than the validation
set, and this turns out to improve the scores in the test set. In conclusion, the GRU
variant is much faster and obtains good results but it’s less stable than the LSTM
version and it needs more attention.
31
Chapter 6
Discussion
• Train on test data: Since the log-likelihood loss supports censored data, it
would be interesting to train a model with the test data from the CMAPSS
data set (chapter 4.1). All of the sequences are "interrupted" in the test set,
which makes it harder for the model. We did a couple of trials but we didn’t
manage to obtain successful results.
• Measurement Noise: Sensors degrade with the time, up to the point that they
can produce aberrant or missing values. But during that process usually two
things change: the amplitude of the signal gets wider, and the slope decreases.
It would be interesting to experiment with new and old sensors, to observe the
effect in the variance of the Weibull Distribution.
• Weak learners: A little bit out of the scope of the thesis but an interesting idea
that Martinsson shared with me, consists in training many weak RNNs instead
of a single stacked RNN. As a sort of boosting technique.
• Different distributions: Why the Weibull? Perhaps there are other alternatives
that work better for particular cases. Maybe the beta distribution would be
interesting to try.
6.2 Conclusions
We have focused on the many-to-many implementation of the batch mode (chapter
5.1), and we have successfully implemented a baseline model to predict the Remain-
ing Useful Life (RUL) using LSTM Networks (chapter 1.3.1). We have managed to
adapt this baseline to the WTTE-RNN framework (chapter 3.4) proposed by Egil
Martinsson, and we have experimented with the properties of the Weibull Distribu-
tion (chapter 1.2). Additionally, we have implemented a variant of the WTTE model
using a GRU network (chapter 1.3.2). As a result, we have evaluated (chapter 4.2)
three different models:
F IGURE 6.1: Violin plot of the error (ỹ − y) for the three models
In terms of the validation loss, the baseline model is obtaining better results. On the
other side,v is the worst on the test set. Since the validation set was generated from
the original training set, and it is used as stopping criteria for the training, it looks
like overfitting the training set benefits on the validation score but penalizes on the
test evaluation. In any case, we have seen that the WTTE-RNN model is just as
good as a regular regressor, but has many interesting attributes that are relevant
for the Predictive Maintenance context.
Bibliography
Reid, Nancy (1994). “A Conversation with Sir David Cox”. In: Statistical Science.
Vol. 9.
Saxena, A. et al. (2008). “Damage propagation modeling for aircraft engine run-
to-failure simulation”. In: 2008 International Conference on Prognostics and Health
Management, pp. 1–9. DOI: 10.1109/PHM.2008.4711414.
Saxena, Abhinav and Kai Goebel (2008). “Turbofan Engine Degradation Simulation
Data Set”. In: URL: http : / / ti . arc . nasa . gov / project / prognostic - data -
repository.
Srivastava, Nitish et al. (2014). “Dropout: A Simple Way to Prevent Neural Networks
from Overfitting”. In: 15, pp. 1929–1958.
XIA, Guo-en and Wei-dong JIN (2008). “Model of Customer Churn Prediction on
Support Vector Machine”. In: Systems Engineering - Theory Practice. Vol. 28, pp. 71–
77.
Zaremba, Wojciech, Ilya Sutskever, and Oriol Vinyals (2014). “Recurrent Neural Net-
work Regularization”. In: CoRR abs/1409.2329.