Email Classification
Email Classification
Email Classification
Abstract
Context. Classifying emails into distinct labels can have a great impact on
customer support. By using machine learning to label emails the system can
set up queues containing emails of a specific category. This enables support
personnel to handle request quicker and more easily by selecting a queue
that match their expertise.
Objectives. This study aims to improve the manually defined rule based
algorithm, currently implemented at a large telecom company, by using
machine learning. The proposed model should have higher F1 -score and clas-
sification rate. Integrating or migrating from a manually defined rule based
model to a machine learning model should also reduce the administrative
and maintenance work. It should also make the model more flexible.
Methods. By using the frameworks, TensorFlow, Scikit-learn and Gensim,
the authors conduct five experiments to test the performance of several
common machine learning algorithms, text-representations, word embeddings
and how they work together.
Results. In this article a web based interface were implemented which can
classify emails into 33 different labels with 0.91 F1 -score using a Long Short
Term Memory network.
Conclusions. The authors conclude that Long Short Term Memory net-
works outperform other non-sequential models such as Support Vector Ma-
chines and ADABoost when predicting labels for emails.
i
Contents
Abstract i
Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
1 Introduction 1
1.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Research gap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Theoretical framework 6
2.1 Natural language processing . . . . . . . . . . . . . . . . . . . . . 6
2.2 Text representation . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 Bag of words . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.3 Word2Vec . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.4 NLP evaluation . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 Machine learning classifiers . . . . . . . . . . . . . . . . . . 12
2.3.2 Deep learning classifiers . . . . . . . . . . . . . . . . . . . 13
3 Method 16
3.1 Hardware and software . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Email dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Data collection for word corpus . . . . . . . . . . . . . . . . . . . 18
3.4 Word representation . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5 Data preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.6 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.7 Evaluation procedures . . . . . . . . . . . . . . . . . . . . . . . . 22
3.7.1 Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.7.2 Friedman test . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.7.3 Nemenyi test . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.8 Experiment design . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.8.1 Non-sequential classifier experiments . . . . . . . . . . . . 24
3.8.2 Sequential classifier experiments . . . . . . . . . . . . . . . 24
3.8.3 Experiment 1, NLP semantic & syntactic analysis . . . . . 25
3.8.4 Experiment 2, NLP evaluated in classification task . . . . 25
3.8.5 Experiment 3, LSTM network size . . . . . . . . . . . . . . 26
ii
3.8.6 Experiment 4, NLP corpus & LSTM classifier . . . . . . . 26
3.8.7 Experiment 5, non-sequential models performance . . . . . 27
3.8.8 Experiment 6, Training time . . . . . . . . . . . . . . . . . 27
5 Discussion 40
6 Implementation of models 44
7 Conclusion 47
8 Future work 48
9 References 49
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
iii
List of Figures
iv
List of Tables
4.1 Performance metrics for each word vector algorithm used in LSTM
classification model . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Comparing the same LSTM network trained on different corpora
and 8 queue labels . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3 Comparing the LSTM network trained on different corpora and 33
email labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4 Jaccard index on queue labels with non-sequential algorithms and
ranking of pre-processing performance . . . . . . . . . . . . . . . 31
4.5 F1 -score on queue labels with Non-sequential algorithms . . . . . 31
4.6 Nemenyi post-hoc test on Jaccard index based on table 4.4 . . . . 32
4.7 Jaccard index with regards to non-sequential algorithms on queue
labels and ranking of classifier performance . . . . . . . . . . . . . 32
4.8 F1 -score with regards to non-sequential algorithms on queue labels
and ranking of classifier performance . . . . . . . . . . . . . . . . 33
4.9 Nemenyi post-hoc test on non-sequential algorithms Jaccard index
based on table 4.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.10 Nemenyi post-hoc test on non-sequential algorithms F1 -score based
on table 4.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.11 Jaccard index on email labels with non-sequential algorithms . . . 35
4.12 F1 -score on email labels with non-sequential algorithms . . . . . . 35
4.13 Jaccard index with regards to non-sequential algorithms on email
labels and ranking of classifier performance . . . . . . . . . . . . . 36
4.14 F1 -score with regards to non-sequential algorithms on email labels 36
4.15 Nemenyi post-hoc test on non-sequential algorithms performance
with email labels . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.16 Execution time in seconds when trained on 10000 emails . . . . . 38
v
List of Algorithms
vi
Acronyms
ANN Artificial Neural Network. 12, 13, 27, 31–36, 38, 41, 47
AvgWV Average Word Vector. 5, 10, 27, 31–33, 35, 36, 41, 47
FN False Negative. 21
FP False Positive. 21
GloVe Global Vectors. 3, 9, 10, 19, 25, 28, 29, 31, 41, 46
LSTM Long Short Term Memory. iv, v, 1–5, 11, 12, 14, 23–27, 29–31, 35, 38–48
vii
MVC Model-View-Controller. ix
NLP Natural Language Processing. 1–4, 6, 7, 19, 23, 25, 26, 29, 41, 43
TF Term Frequency. 8
TN True Negative. 21
TP True Positive. 21
viii
Glossary
classification rate The frequency of emails that are classified with a valid label.
4
corpus A large set of structured text documents, normally one document per
row. v, 3, 4, 9, 18, 23, 25, 26, 30, 31, 43
document Any source which contains text segments related to a shared topic,
e.g. an email, blog post, Wikipedia page etc. 7, 8, 10, 11, 21
Gensim Gensim is an open source software library focusing on vector space and
topic modelling. 16
one-hot-vector A vector with all elements set to zero except for one. 7
queue labels Aggregated version of the 33 email labels that consist of 8 different
classes. 17
Scikit-learn Scikit-learn is an open source software library for simple and efficient
data mining and data analysis. 16, 27
sequential model A machine learning model that takes variable length time
series input. 5
ix
support ticket A support request made from a customer to the company results
in a ticket which is queued and processed by the company support team. A
Issue Tracking System (ITS) is commonly used in order to manage support
tickets. 1
x
Chapter 1
Introduction
1
Chapter 1. Introduction 2
with a LSTM classifier to tag emails based on the contents of the email. The
tagged emails are then sent to the correct email queue where they are processed
by the specialised support personnel.
This report is structured as follows. First the theoretical framework is presented
in chapter 2, from which the experiments, discussions and conclusions are based
on. Chapter 3 describes the method and how each experiment was conducted.
The results are presented in chapter 4, followed by the discussion and conclusion
which is presented in chapter 5 and 7 respectively.
model [8]. The word embeddings are supposed to model the language, but finding
a large enough corpus that represent the domain in which they are used is difficult.
Word vectors trained on huge corpora, such as Google News which is trained
on about 100 billion words, are available to the public but they are only trained
on English. Fallgren, Segeblad and Kuhlmann have evaluated the three most
used word2vec models, BoW, skipgram and Global Vectors (GloVe), on a Swedish
corpus. They evaluate their word vectors on the Swedish Association Lexicon.
They show that Continuous Bag-of-Words (CBoW) perform best with a dimension
of 300 and 40 iterations [9].
Nowak et al. show that LSTM and bi-directional LSTM perform significantly
better when detecting spam and classifying Amazon book reviews compared to
the non-sequential approach with Adaptive Boosting (ADA) and BoW [10].
Yan et al. describe a method of multi label document classification by using
word2vec together with LSTM and Connectionist Temporal Classification (CTC).
Their model is evaluated on different datasets including emails and produce
promising results compared to other versions of both sequential deep learning
models such as RNN and non-sequential algorithms such as Support Vector
Machines (SVM). Their research tries to solve the problems with multi-label
classification by first representing the document with a LSTM network, then
training another LSTM network to represent the ranked label stream. Finally
they apply CTC to predict multiple labels [11].
Gabrilovich and Markovitch compare SVM with the C4.5, a Decision Tree
(DT) algorithm on text categorisation. The C4.5 algorithm outperform SVM by a
large margin on datasets with many redundant features. They show that the SVM
can achieve better results than the C4.5 algorithm by removing the redundant
features using aggressive feature selection [12].
Measuring the energy consumption used by an algorithm is hard. However
measuring the execution time is trivial but useful. Bayer and Nebel conclude that
the fastest algorithm is not always the one that consumes the least amount of
energy, however in most cases it is true that shorter execution time consumes less
energy [13].
5. To which degree does the LSTM network size and depth affect
the classifier performance?
Motivation: Knowing how to tune a LSTM is not easy. Answering this
question will help researchers and machine learning experts tune their network
better if trained in a similar environment.
1.4 Objectives
The objective of this work is to increase the performance of the automated email
classification and sorting process at a large telecommunication company that
support this thesis. In order to improve the current model our proposed model
should have greater or equivalent F1 -score and a better classification rate. The
possibility to integrate the model in the current production environment should
also be delivered.
Chapter 1. Introduction 5
The telecommunication company will supply all the emails and the infrastruc-
ture required to evaluate the classifiers. The proposed framework will be evaluated
by domain experts in a test environment where it is compared against the current
rule-based classifier. If the suggested model is performing better than the current
classifier and if it adds business value to said company it will be integrated into
the production chain.
The research questions will enable researchers to continue to make progress
on Swedish email classification. The framework, although owned by the telecom-
munication company, will serve as a basis for further experiments on multi label
emails and semantic analysis.
1.5 Delimitations
The emails could be structured in chains and contain several emails that are sent
back and forth between the customer and the support personnel. The models
could theoretically use this to decide the context and how the conversations change
subject. The emails our model use do not contain these chains, they are separated
and classified individually. The model will be restricted to and there will be no
classification or training performed on any document types other than emails.
The non-sequential models are used as a baseline for comparison with the
LSTM network. The parameters of these models are not optimised and do not use
preprocessing techniques such as lemming, stemming or stop word removal. The
BoW or Average Word Vector (AvgWV) text representation models used for the
non-sequential models are also not optimised. The BoW hyperparameters filter
the numbers of words within a relative range, i.e sub- and supersampling. This is
done to make the experiments possible with all non-sequential models.
The corpus is built to contain a recommended amount of words, i.e enough
to train general word vectors. Specialised word vectors can be trained with less
words but for general vectors the more words in the corpus the better [15]. The
content of the corpus is therefore not reviewed or optimised because of its relative
large size of just over one billion words.
Due to the vast amount of computational power required to run the algorithms,
extensive hyperparameter testing is not practical. Instead a set of common
hyperparameters were used during the experiments and no further optimisation of
hyperparameters was performed.
Chapter 2
Theoretical framework
The theoretical framework covers most of the theory behind the models and
algorithms used in this paper. The models and algorithms are explained as well
as the underlying theory that defines them.
6
Chapter 2. Theoretical framework 7
projecting data in a n-dimensional space and then separating that space into
subspaces which are labelled with a specific class. The first step, projecting the
data, varies greatly depending on the different types of data. However a general
requirement is that the projection have to be of fixed output length, i.e. if you
want to project a document you have to make sure that the result is of the same
dimension regardless of the document length.
In order for a text document to be projected into a n-dimensional space we
need to consider the fact that documents contain texts of variable length. The
texts themselves also consists of words of variable length. In order to manage the
words it is common to build a dictionary of fixed length. The words can then be
represented as a one-hot-vector. Depending on the NLP model these vectors are
managed differently. There are three common categories of NLP models when it
comes to text processing, count based, prediction based and sequential. Count based
methods are based on the word frequencies with the assumption that common
words in a document have significant meaning to the class. Prediction based
methods models the probabilistic relations between words, e.g. in the context of
“The cat drank ...” the words milk and water are more likely than car and shoe.
Sequential models are based on the assumption that a sequence, or stream, of
words are significant to the documents semantic meaning. Sequential models are
often combined with prediction based models to better capture the linear relations
together with the sequential order of the words.
2.2.1 Preprocessing
In the preprocessing step the documents are transformed from the raw document
to a structured document that is intended to contain as much information as
possible without discrepancies that can affect the prediction result. A common
method to increase the information density of a document is to remove the words
that are very common and rarely has any significance, often referred to as stop
words. These are word such as “the”, “are”, “of”, which are insignificant in a larger
context. In BoW these are a list of predetermined words, but word2vec take a
probabilistic approach, called subsampling, which avoid overfitting on the most
frequent words. For each instance of a word in word2vec a probability of removal
is decided by equation 2.1 where t, usually set to 10−5 , is the threshold and f (wi )
is the word frequency [14].
s
t
P (wi ) = 1 − (2.1)
f (wi )
In BoW these words are discarded from becoming features and in word2vec they
are sometimes skipped, i.e not included in the context window used for training.
In a corpus of millions of words there will be some outliers, e.g. random
sequences of numbers, noise, misspellings etc. As these words are very uncommon
Chapter 2. Theoretical framework 8
and often does not appear more than a couple of times it is common to enforce a
minimum count before adding words to the dictionary.
In BoW the documents dA = “the power was the missing factor” and dB =
“sadly we are missing the wisdom” would be modelled by the features and the final
vector representation seen in table 2.1.
To capture the context of words in a BoW model it is common to combine
the terms in a document in a model called bag of n-grams. These n-grams are
combinations of tokens found in the documents. A Bag of Words Bi-gram (BoWBi)
model based on the document dA mentioned previously would have the following
features:
the power, power was, was the, the missing, missing factor
Inverse Document Frequency (IDF) weighting scheme is introduced to solve
the problem of equally weighted terms. The document frequency dft , is defined by
the number of documents that contain a term t. If a term t has a low frequency
and appears in a document d then we would like to give the term a higher weight,
i.e increase the importance of the term t in the document. The IDF weight
is therefore defined as shown in equation 2.2 where N is the total number of
documents [16].
N
idft = log (2.2)
dft
Chapter 2. Theoretical framework 9
2.2.3 Word2Vec
The Word2Vec model is based on the assumption that words with similar semantics
appear in the same context. This can be modelled by placing a word in a high
dimensional vector space and then moving words closer based on their probabilities
to appear in the same context. There are mainly three different methods to
calculate these vectors, CBoW [3], skipgram [14], and GloVe [15]. A relatively
large corpus is required for these models to converge and achieve good results
with word vectors, normally around one billion words or more.
CBoW
The CBoW method is based on the principle of predicting a centre word given
a specific context. The context is in this case the n-history and n-future words
from the centre word, where n is determined by the window size. The structure
of CBoW is somewhat familiar to auto-encoders, the model is based on a neural
network structure with a projection layer that encodes the probabilities of a word
given the context. The goal is to maximise the log probabilities which makes
CBoW a predictive model. The projection layer and its weights is what later
becomes the word vectors. However in order to feed the network with words you
first have to encode the words into one-hot-vectors which is defined by a dictionary.
This dictionary can be over a million words while the projection layer typically
range from anywhere between 50 and 1000 nodes [3, 18].
Skipgram
The skipgram model is similar to the CBoW model but instead of predicting the
centre word given the context, skipgram predicts the context given the centre
word. This allows the skipgram model to generate a lot more training data which
makes it more suitable for small datasets, however it is also several magnitudes
slower than CBoW [14].
Skipgram n-gram
The skipgram n-gram model is based on skipgram but instead of using a dictionary
with complete words it uses variable lengths n-grams. Other models rely on the
dictionary to build and query vectors, however if the word is not in the dictionary
the model is unable to create a vector. The skipgram n-gram model can construct
word vectors for any words based on the n-grams that construct the word. The
model has slightly lower overall accuracy but with the benefit of not being limited
to the dictionary.
Chapter 2. Theoretical framework 10
GloVe
The GloVe model does not use neural networks to model the word probabilities
but instead relies on word co-occurrence matrices. These matrices are built
from the global co-occurrence counts between two words. GloVe then performs
dimensionality reduction on said matrix in order to produce the word vectors. Let
X be the co-occurrence matrix where
P Xij is the number of times word j occurs in
the context of word i. Let Xi = k Xik be the number of times any word appears
in the context of i. The probability that word j appears in the context of i can
now be calculated as following
Xij
Pij = P (j|i) = (2.3)
Xi
This makes GloVe a hybrid method as it models probabilities based on frequencies
[15].
2.3 Classification
Single-label text categorisation (classification) is defined as the task of assigning a
category to a document given a predefined set of categories [20]. The objective is
to approximate the document representation such that it coincides with the actual
category of the document. If a document can consist of several categories we need
to adapt our algorithm to output multiple categories, which is called multilabel
classification. The task is then to assign an appropriate amount of labels that
correspond with the actual labels of the document [20].
A fundamental goal of classification is to categorise documents that have the
same context in the same set, and documents that do not have the same context
in separate sets. This can be done with different approaches that involve machine
learning algorithms. Machine learning algorithms learn to generalise categorises
from previously seen documents which is later used to predict the category of
previously unseen documents. Peter Flach explain in his book, “Machine learning:
the art and science of algorithms that make sense of data” three different groups
of machine learning algorithms namely, geometrical, probabilistic and logic based
models [21]. The different groups of classifier achieve the same goal, but their
methods are different. These classifiers are hereafter referred to as non-sequential
classifiers since they do not handle the words in the emails in a sequence. A
sequential classifier, such as LSTM, handle each word in the email sequential,
which allows it to capture relations between words better and possibly utilise the
content of the email better than a non-sequential classifier.
Geometric based classifiers split the geometric space in different parts where
each subspace belongs to a class [21]. The boundaries are optimised to reduce the
number of wrongly classified training instances. Probabilistic classifiers models
probabilistic relationships between the features and the classes [21]. They calculate
the probability of P (Y |X) where X is known and Y is the class. Rule based model
build a tree based on rules where each node has several child nodes based on the
rules constructed [21]. The tree is segmented by iterations where each child is
partitioning the instance space. A leaf is then assigned a class, value, probability
or whatever is preferred.
Chapter 2. Theoretical framework 12
Decision tree
A DT classifier is modelled as a tree where rules are learned from the data in a
if-else form. Each rule is a node in the tree and each leaf is a class that will be
assigned to the instance that fulfil all the above nodes conditions. For each leaf a
decision chain can be created that often is easy to interpret. The interpretability
is one of the strengths of the DT since it increases the understanding of why the
classifier made a decision, which can be difficult to achieve with other classifiers.
The telecommunication company is today using a manually created decision tree,
in which the rules are based on different combinations of words.
Naive bayes
NB is a probabilistic classifier which is build upon the famous Bayes’ theorem
P (A|B) = P (B|A)×P
P (B)
(A)
, where A is the class and B is the feature vector [23, 24, 25].
The probabilities of P (B|A), P (A) and P (B) are estimated from previously known
instances, i.e training data [23, 25]. The classification errors are minimised by
selecting the class that maximises the probability P (A|B) for every instance [25].
The NB classifier is considered to perform optimal when the features are
independent of each other, and close to optimal when the features are slightly
dependant [23]. Real world data does often not meet this criteria but researchers
have shown that NB perform better or similar to C4.5, a decision tree algorithm
in some settings [23]. The researchers argue that NB performs well even when
there is a clear dependency between the features, making it applicable in a wide
range of tasks.
Chapter 2. Theoretical framework 13
AdaBoost
ADA is built upon the premise that multiple weak learners that perform somewhat
good can be combined using boosting to achieve better result [26]. This algorithm
perform two important steps when training and combining the weak classifiers,
first it decided which training instances each weak classifier should be trained on,
and then it decides the weight in the vote each classifier should have.
Each weak classifiers is given a subset of the training data that where each
instance in the training data is given a probability that is decided by the previous
weak classifiers performance on that instance. If the previous weak classifiers
have failed to classify the instance correct it will have a higher probability to be
included in the following training data set.
The weight used in the voting is decided by each classifiers ability to correctly
classify instances. A weak classifier that perform well is given more influence than
a classifier that perform bad.
a(W × X + b) (2.6)
The weights and biases in ANN has to be tweaked in order to produce the expected
outcome. This is done when training the network which normally is done with
backpropagation. The backpropagation algorithm is based on calculating the
gradients given a loss function and then edit the weights accordingly given a
optimisation function. Normally ANN is designed with a input layer matching
the size of the input data, a number of hidden layers and finally a output layer
matching the size of the output data.
yt = Why ht + by (2.8)
Training the RNN is normally done by estimating the next probable output in
the sequence and then alter the weights accordingly. However consider a stream
of data for which a prediction is done at each time step, each prediction will be
based on the current input and all previous inputs. This makes it very hard to
accurately train the network as the gradients will gradually vanish or explode the
longer the sequences are [2].
N
1 X
L(w) = Hx (yn ) (2.10)
N n=1 n
The Shannon entropy function, seen in figure 2.9, is the base upon which the cross
entropy loss is defined [30]. Equation 2.10 defines the cross entropy loss.
Overfitting
A desired trait in machine learning models is its ability to generalise over many
datasets. Generalisation in machine learning means that the model has low error
on examples it has not seen before [32]. There are two measures which normally is
used to indicate how well the model fits the data, bias and variance. The bias is a
measure of how much the model differ over all possible datasets, from the desired
output. The variance is a measure of how much the model differ between datasets.
When the model first starts training the bias will be high as it is far from
the desired output, however the variance will be low as the data has had little
influence over the model. Late in the training the bias will be low as the model
has learned the underlying function. However if trained too long the model will
start to learn the noise from the data, this is refereed to as overfitting. In the case
of overfitting the model will have low bias as it fits the data well and high variance
as the model follows the data too well and don’t generalise over datasets [31]. The
F1 -score measures the harmonic mean between the bias and the variance, usually
it is preferred to have a good balance between the bias and the variance.
There exists methods to avoid overfitting. Early stopping is one of them and
involves stopping the training of the model due to some stopping criterion. The
criteria can be human interaction, low change in loss or any other ad-hoc definition
[32]. Another method is dropout which only trains a random set of neurons when
updating the weights. The idea is that when only a subset of the neuron are
updated at the same time, they each learn to recognise different patterns and
therefore reduce the overall overfitting of the network [33].
Chapter 3
Method
As the authors had data available and a practical environment to work and test
the models in, conducting experiments were deemed suitable as a method to
answer the research questions. This chapter describes the data and its design and
properties. It also covers how data collection, pre-processing, word representation
and experiment evaluation were performed. Section 3.8 explains the aim of each
experiment and why it is performed.
16
Chapter 3. Method 17
The email labels can be aggregated into queue labels which is an abstraction of
the 33 labels into 8 queue labels. The merger is performed by fusing emails from
the same email queue, which is a construction used by the telecommunication
company, into a single queue label. The labels that are fused together are often
closely related to each other, which effectively will reduce the amount of conflicts
between the email labels and their contents. If an email contain two or more labels
it is disregarded since it might introduce conflicting data which is unwanted when
training the classifier. Without “DoNoUnderstand” and the multilabel emails
there are a total of 58,934 emails in the dataset.
Each email contains a subject and body which is valuable information for the
classifier. The emails may also contain Hypertext Markup Language (HTML) tags
and meta data which are artefacts from the infrastructure. The length of each
email varies, however the average is 62 characters. Figure 3.2 shows the length
distribution where emails under 100 characters is the most common.
corpus on the targeted domain, in our case it is the support emails, and then fill
the corpus with data from other sources to make it more extensive.
Parameter Value
Vector size 600
Window size 10
Minimum word occurrences 5
Iterations 10
Parameter Value
Minimum document frequency 0.001
Maximum document frequency 0.01
Metric Definition
TP Label is present, label is predicted
TN Label is not present, label is not predicted
FP Label is not present, label is predicted
FN Label is present, label is not predicted
The metrics are defined per label
Table 3.3: Positives and negatives definition
There exists several metrics that utilise the coincidence matrix. However there
are pitfalls that must be considered when using the metrics. Accuracy is defined
as the true predictions divided by the total, shown in equation 3.1. In a multiclass
problem, in our case it is email labelling with 33 classes, the average probability
1
that a document belongs to a single class is 33 ≈ 0, 0303, i.e. 3.03%. A dumb
algorithm that rejects all documents to belong to any class would have a error
rate of 3% and an accuracy of 97% [41]. To gain better insight we also measure
the Jaccard index seen in equation 3.2. The Jaccard index disregard the TN and
only focus on the TP which makes the results easier to interpret. Equation 3.3,
precision, measure how many TP there is among the predicted labels and equation
3.4, recall, measure how many labels that are correctly selected amongst all labels.
A classifier that chooses all labels as predicted would have a low precision since it
would have many FP but the recall would be high because there would not be
any FN. The F1 -score is the harmonic mean between precision and recall [21], a
good score is only achieved if there is a balance between the precision and recall.
The F1 -score make an implicit assumption that the TN are unimportant in the
operative context, which they are in this context.
Chapter 3. Method 22
True class
Positive Negative
Negative Positive
Predicted Class True Positive (TP) False Positive (FP)
Olson and Delen defines the following metrics for evaluating predictive models
[40] as described in equation 3.1, 3.2, 3.3, 3.4, 3.5. These measurements are used
to give insights in the classifiers performance on previously unseen emails.
[21]
TP + TN
Accuracy = (3.1)
TP + TN + FP + FN
TP
JaccardIndex = (3.2)
TP + FP + FN
TP
P recision = (3.3)
TP + FP
TP
Recall = (3.4)
TP + FN
2T P
F1 − score = (3.5)
2T P + F P + F N
3.7.1 Ranking
Ranking is used to obtain a rapid understanding and significance of the result [42].
The rank is assigned in order of magnitude in which a higher rank is a better.
Ranking can be performed if there are two or more results that can be compared
against each other.
Chapter 3. Method 23
with 10-times 10-fold cross validation. Instead we use a test set that consist of
a random 10% sample of the data and train the model on the remaining 90%.
The sets are randomly chosen from a unifrom distribution without any class
balancing. These experiments measure the accuracy, precision, recall, F1 -score
and the Jaccard index. The sequential models are not tested using statistical tests
because there is too little data to work properly.
The non-sequential models are tested with 10-times 10-fold cross validation.
These models will be measured by the F1 -score and the Jaccard index since they
measure the performance from different perspectives which both are valuable.
Friedman test and the Nemenyi post-hoc test will be performed on both of the
measurements to show if there is any significant difference in the results.
Parameter Value
Word limit (sequence length) 100
Hidden layers 128
Depth layers 2
Batch size 128
Learning rate 0.1
Maximum epochs 200
Dropout 0.5
Forget bias 1.0
Use peepholes False
Early stopping True
Cells Layers
128 1
512 1
1024 1
128 2
512 2
1024 2
5
The default values are described in the documentation http://scikit-learn.org/stable/
modules/classes.html
Chapter 4
Result and analysis
Figure 4.1 and 4.2 show the total and per category accuracy of the semantic
and syntactic questions. The different models performed similar however CBoW
achieved the highest total accuracy of 66.7%. Skipgram-ng achieved the lowest
total accuracy but with the added benefit of being able to construct vectors for
words not in the original dictionary. GloVe trained on the smaller corpus solely
based on the emails achieve a total accuracy of 2.1%, which is 64.6% points less
than the best model.
28
Chapter 4. Result and analysis 29
Figure 4.2: Word vector semantic and syntactic accuracy per category.
Table 4.1: Performance metrics for each word vector algorithm used in LSTM
classification model
The results from table 4.1 show that the word vectors generated by GloVe
perform best. The results are very similar and drawing any conclusions is therefore
difficult. In the following experiments the word vectors trained by GloVe will be
used.
Chapter 4. Result and analysis 30
Table 4.2: Comparing the same LSTM network trained on different corpora and
8 queue labels
The results from table 4.2 show that the full corpus of Språkbanken, Wikipedia
and emails perform better then the corpus only based on the emails. The Jaccard
index is 6% points better and F1 -score 3% points better when LSTM is trained on
the big corpus. However it is interesting that LSTM can achieve good performance
with word vectors based on a small corpus even though it scored terrible in the
semantic and syntactic analysis as seen in figure 4.1 and 4.2.
Table 4.3: Comparing the LSTM network trained on different corpora and 33
email labels
Chapter 4. Result and analysis 31
Table 4.3 show the results when LSTM is trained on two different GloVe word
vectors on different corpora. Training LSTM on the big corpus increases the
Jaccard index by 6% points and F1 -score with 4% points. The relative performance
is about the same as the results from section 4.4.1 trained on queues. The decrease
in F1 -score may suggest that the corpus based on the emails may struggle when
the number of classes grows.
The results from this experiment and the experiment in section 4.4.1 answer
research question 4 regarding the corpus effect.
Table 4.4 and table 4.5 show the different pre-processing algorithms performance.
From table 4.4 the average rank suggest that BoWBi perform best when compared
Chapter 4. Result and analysis 32
to BoW and AvgWV. Even though BoWBi seem to perform better on average
there are two outliers in which AvgWV perform about 10 percentage points better
which also is the best result obtained.
Friedman test confirm that there is a significant difference in the performance
when measuring the Jaccard index at an significance level of 0.05, X 2 (2) = 7.600, p-
value = 0.022. But the test does not confirm a significant difference at significance
level 0.05 for the F1 -score measurements, X 2 (2) = 3.600, p-value = 0.166.
In table 4.7 and table 4.8 the results are presented where the focus is on the
classification algorithm and how they perform against each other. ANN and SVM
obtain the best result amongst all when trained on AvgWV. The average rank
from table 4.7 and table 4.8 show that SVM perform best in all cases and that
NB perform worst in all cases. ADA, ANN and DT seem to perform equal except
for the good result obtained by ANN when trained on AvgWV. Friedman test
on the Jaccard index results in table 4.7, X 2 (2) = 10.667, p-value = 0.031, does
reject the null hypothesis that all classifiers perform equal at an significance level
of 0.05. The F1 -score result show the same pattern, X 2 (2) = 10.237, p-value =
0.037 which reject the null hypothesis at significance level 0.05.
Table 4.9 and table 4.10 show that the only significant performance difference
is between SVM and NB. From the ranking in tables 4.7 and 4.8 SVM perform
best in all cases.
Figure 4.3: The plot show the median in red, first quantities and min/max
values with possible outliers shown as circles
The box plot in figure 4.3 show the classification performance over 10 folds
using the combination of pre-processing algorithm and classification algorithm
that performed best together. The variance is low for all algorithms which is
Chapter 4. Result and analysis 35
a good indication that the model does not overfit and can generalise well for
previously unseen emails.
Table 4.11 and table 4.12 show the results when the pre-processing algorithms
are tried on the email labels. The Friedman tests on Jaccard index, X 2 (2) = 3.600,
p-value = 0.165, and F1 -score, X 2 (2) = 2.800, p-value = 0.247, does not show
a significant difference with an significance level of 0.05. SVM and ANN does
perform about 10 percentage points better when trained on AvgWV compared to
the other pre-processing algorithms and other classification algorithms.
The performance does decrease when compared with the results, based on
queues, found in tables 4.7 and 4.5 which is expected due to the increased difficulty
of more classes. The result may decreases because some of the classes may be
closely related to each other. Closely related labels may be hard to separate which
Chapter 4. Result and analysis 36
could explain the drop in performance of email labels compared to the queue
labels.
Table 4.13 and table 4.14 is the transformed version of table 4.11 & 4.12.
Friedman test with focus on the classification algorithm does show a significant
difference on Jaccard index, X 2 (2) = 10.667, p-value = 0.031, with an significance
level of 0.05 but not on the F1 -score, X 2 (2) = 7.200, p-value = 0.126. From the
ranks in table 4.13 we can see that SVM is performing best using all pre-processing
algorithms whereas NB is performing worst in all cases.
One significant difference were found between SVM and NB as seen in table
4.15 at an significance level of 0.05. There is however differences between the
other algorithms even though they are not considered to be significant enough.
Figure 4.4: The plot show the median in red, first quantities and min/max
values with possible outliers shown as circles
Figure 4.4 show visually, though a box plot, how the performance differentiate
between the classifiers. The plot is drawn from the text representation that yield
the maximum accuracy per classifier. SVM has the highest average accuracy with
low variance and low difference between the lowest and highest values.
Chapter 4. Result and analysis 38
Table 4.16 present the execution time of the algorithms trained on 10000 emails.
The wall time is measured from the start to the finish of the training. There is a
big difference in the training time where NB is the fastest to train with less than
one second. SVM is the slowest of the non-sequential algorithms with a training
time of 39 second. LSTM does train in several epochs in which it trains on the
same samples several times to adjust its weights, the process is time consuming
which is shown by the execution time. The training time of the LSTM network is
strongly correlated with how many epochs the network needs before convergence.
In this measurement the LSTM network needed 94 epochs to converge.
Figure 4.5: Certainty values per label using the proposed LSTM model based
on the test dataset, illustrated with a box plot.
Chapter 5
Discussion
40
Chapter 5. Discussion 41
questions were used. These questions were defined by the authors and considered
extensive enough. However as the authors are not linguistic experts there may
have been both discrepancies and faults in the dataset. Verifying the integrity
of the dataset and also expand the set with more questions is important if it
should be used to evaluate the word vectors performance. Evaluation the word
vectors using QVEC, as proposed by Tsvetkov et al., may be a better evaluation
method and lead to a better understanding of the word vectors performance in a
classification task [47].
In order to determine which word vector model to use, each model were
evaluated using a set of Swedish semantic and syntactic questions. The models
performed approximately the same with the exception of GloVe which performed
overall about one percentage point better than the other models. However when
the word vectors were used in training of the LSTM model they had little or no
difference in performance. The cause may be due to the LSTM network being
able to learn the same patterns in the dataset even with differences between the
word vectors. When choosing the best word vector model for a classifier it is
therefore important to evaluate them in a classification task, since the performance
of the semantic and syntactic questions did not correlate with the performance
of the word vectors in a classification task. The semantic and syntactic analysis
show how well the word vectors model the language in general which may not
be relevant for domain specific classification. The LSTM network is shown to be
able to adapt to word vectors that do not achieve good semantic and syntactic
results. It is possible that the word vectors based on the emails does model the
domain language which may be what the LSTM network utilise. Incorporating
domain language in a corpus is therefor recommended because it may add valuable
relations between words that have a different semantic and syntactic meaning in
the domain.
As extensive computation is used to solve problems that are unsolvable by
humans we have to take into account the efficiency of the algorithm that is
used. The energy usage can differ severely between different algorithms depending
on several factors. The execution times in section 4.6 show that the LSTM
network has about 513 times longer execution than SVM which is the slowest of
the non-sequential algorithms. LSTM does execute both on the CPU and the
GPU which neither of the non-sequential algorithms do. Improving the LSTM
hyperparameters may lead to a reduced execution time. Techniques as warm
reboot could also increase the convergence rate [45]. If energy consumption is a
concern and the extra performance increase given by LSTM is redundant it is
recommended to use ANN with AvgWV.
The different classifiers are well suited for NLP tasks. LSTM does perform
better than the other classifiers, but it does require more data. If NLP tasks
are to be solved in other domains that do not generate enough data for a LSTM
to work properly it would be advisable to train a SVM using AvgWV. LSTM is
more adaptable but knowing how to optimise the network does require domain
Chapter 5. Discussion 42
Classifying emails wrong may affect the customer who sends the email. If a
company specialise their support personnel they might receive emails that they are
not trained to answer. In those cases it is important to have a strict policy that
require all personnel to forward the email to another colleague that can handle the
errand better. Wrongly classifying emails that contain sensitive information could
lead to information disclosure if personnel that do not have the correct security
clearance receive the information.
The following list aims to explicitly answer our research questions previously
defined in section 1.3.
1. To which degree does the NLP model (e.g word2vec) affect the
classifiers classification performance?
Result: The NLP model used does not significantly affect the classification
performance as LSTM seem to compensate for the difference between them.
5. To which degree does the LSTM network size and depth affect
the classifier performance?
Result: The network size and depth does not affect the classification perfor-
mance significantly if a suitable size and depth is used.
Chapter 6
Implementation of models
44
Chapter 6. Implementation of models 45
labels.
Algorithm 3: API request handling
input : Email e to be classified and optional preferred model m
output : Class and its certainty if applicable
wvd ← load dictionary;
wvv ← load vectors;
lstml e ← load lstm multilabel binarizer;
svml e ← load svm multilabel binarizer;
svm ← load svm model;
dropout = 1 ; // Disable dropout for predictions
f orgetb ias = 0 ; // Disable forget bias for predictions
restore lstm model by loading the latest TensorFlow checkpoint;
model ← get model from request else LSTM ; // Default model is LSTM
email ← get email from request;
clean the email from unknown characters;
lookup the indices using wvd of each word in the email;
lookup the vectors using wvv of each index in the email;
if model ≡ lstm then
predict the class using the lstm model and lstml e;
else if model ≡ svm then
average the email to one single vector;
predict the class using the svm model and svml e;
else
return error, invalid model;
end
An example request could be made as following
curl
−H " Content−Type : a p p l i c a t i o n / j s o n " −X POST
−d ’ { " model " : " l st m " ,
" e m a i l " : " hej , j a g har problem med min f a k t u r a " } ’
h t t p s : / / a p i . fqdn . com/ a p i / c l a s s i f y
which will result in a response with the format
{
" class ": " Invoice " ,
" prob " : " 0 . 9 9 3 2 "
}
The server classifies approximately 20 emails per second with LSTMs. The LSTM model
used by the API was trained using the hyperparameters described in table 6.1. The
model is trained on the full dataset of ≈ 60000 support emails from a live telecom
environment. The hardware and software packages are the same as described in section
3.1.
Chapter 6. Implementation of models 46
Parameter Value
Word vectors GloVe
Corpus Full (Språkbanken, Wikipedia, emails)
Word limit (sequence length) 100
Hidden layers 128
Depth layers 2
Batch size 128
Learning rate 0.1
Maximum epochs 200
Dropout 0.5
Forget bias 1.0
Use peepholes False
Early stopping True
Of the six different classifiers that are evaluated LSTM perform best on both queues
and the 33 labels. The LSTM network achieve almost as good results when using the 33
labels as when using the queue labels. Aggregating the labels does therefore increase
the performance, but only nominal. Of the non-sequential classifiers, ANN and SVM,
achieved best results on both the queue and the 33 labels when trained on AvgWV. The
use of AvgWV improved the performance substantially compared to BoW and BoWBi if
used with a suitable classifier. The training time of LSTM is several factors longer than
that of the non-sequential models, if power consumption and training time is important,
select a non-sequential modes such as SVM with AvgWV.
A framework was implemented based on the results of the experiments. The frame-
work is intended to generate business value for a company by reducing the work hours
spent on tuning rule-based systems. Changing to a machine learning based framework
does also allows for faster and easier development for features such as sentiment analysis
which will add further business value to a company. LSTM is chosen as the main
classifier because of its classification performance and the features is supports such as
the possible to receive a probability value indicating the certainty of the prediction. The
probability can be of much use for a data analyst when improving the model by knowing
its strengths and weaknesses.
47
Chapter 8
Future work
Extending the classification to identify emotions in the email can help the support team
deal with angry or dissatisfied customers [5]. Doing so will improve the customer service
since the support personnel can cope with the emotions of the customer. This will
increase the customer satisfaction and decrease the number of customers that change
provider.
Given that the model only classifies the latest response in an email conversation but
often keeps the subject of the original email there may be conflicts that causes confusion
for the LSTM network. There may be a performance increase by separating the subject
from the body and use two LSTM networks to classify each part separate. The two
networks may then be interlaced by a fully connected neural network.
Currently the emails are processed before entering the classifier. In the early stages
of preprocessing all other bodies than the first is stripped. The other bodies contains
previous conversations and may be helpful during classification. However the effect of
stripping other bodies versus including two or more is unknown and future work may
compare the effect of including several bodies during classification.
Currently the network is trained once and does not change its predictions in produc-
tion even if they were to be wrong. If the network are to improve over time it has to
be periodically retrained. This procedure is both time and computationally costly. It
also introduces a delay between the correction and the actual adapting of the model.
One way to reduce this time and allow the network to adapt continuously to changes in
the email environment is reinforcement learning. Future work may look closer at the
benefits and usefulness of reinforcement learning.
48
Chapter 9
References
[2] M. Sundermeyer, R. Schlüter, and H. Ney, “Lstm neural networks for language mod-
eling,” in Thirteenth Annual Conference of the International Speech Communication
Association, 2012.
[5] R. Bougie, R. Pieters, and M. Zeelenberg, “Angry customers don’t come back, they
get back: The experience and behavioral implications of anger and dissatisfaction in
services,” Journal of the Academy of Marketing Science, vol. 31, no. 4, pp. 377–393,
2003.
[10] J. Nowak, A. Taspinar, and R. Scherer, LSTM Recurrent Neural Networks for Short
Text and Sentiment Classification. Cham: Springer International Publishing, 2017,
pp. 553–562.
49
Chapter 9. References 50
[11] Y. Yan, Y. Wang, W.-C. Gao, B.-W. Zhang, C. Yang, and X.-C. Yin, “Lstm{2} ˆ :
Multi-label ranking for document classification,” Neural Processing Letters, May
2017.
[13] H. Bayer and M. Nebel, “Evaluating algorithms according to their energy consump-
tion,” Mathematical Theory and Computational Practice, p. 48, 2009.
[15] J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors for word
representation,” in Proceedings of the 2014 conference on empirical methods in
natural language processing (EMNLP), 2014, pp. 1532–1543.
[17] N. S. Baron, “Letters by phone or speech by other means: the linguistics of email,”
Language & Communication, vol. 18, no. 2, pp. 133 – 170, 1998.
[18] C. McCormick, “Word2vec tutorial - the skip-gram model,” 2017. [Online]. Available:
http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
[21] P. Flach, Machine learning: the art and science of algorithms that make sense of
data. Cambridge University Press, 2012.
[22] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20,
no. 3, pp. 273–297, Sep 1995.
[23] P. Domingos and M. Pazzani, “Beyond independence: Conditions for the optimality
of the simple bayesian classifer,” in Proc. 13th Intl. Conf. Machine Learning, 1996,
pp. 105–112.
[24] H. Zhang, “The optimality of naive bayes,” Association for the Advancement of
Artificial Intelligence, vol. 1, no. 2, p. 3, 2004.
[27] A. Graves, A. r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent
neural networks,” in 2013 IEEE International Conference on Acoustics, Speech and
Signal Processing, May 2013, pp. 6645–6649.
[31] L. Bottou, “Stochastic gradient descent tricks,” in Neural Networks: Tricks of the
Trade: Second Edition. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp.
421–436.
[32] L. Prechelt, “Automatic early stopping using cross validation: quantifying the
criteria,” Neural Networks, vol. 11, no. 4, pp. 761 – 767, 1998.
[36] R. Řehůřek and P. Sojka, “Software Framework for Topic Modelling with Large
Corpora,” in Proceedings of the LREC 2010 Workshop on New Challenges for NLP
Frameworks. Valletta, Malta: ELRA, May 2010, pp. 45–50, http://is.muni.cz/
publication/884893/en.
Chapter 9. References 52
[37] Meta, “Data dumps — meta, discussion about wikimedia projects,” 2017, [Online;
accessed 18-December-2017]. [Online]. Available: https://meta.wikimedia.org/w/
index.php?title=Data_dumps&oldid=17422082
[38] S. R. Eide, N. Tahmasebi, and L. Borin, “The swedish culturomics gigaword corpus:
A one billion word swedish reference dataset for nlp,” pp. 8–12, 2016.
[39] O. Levy, Y. Goldberg, and I. Dagan, “Improving distributional similarity with lessons
learned from word embeddings,” Transactions of the Association for Computational
Linguistics, vol. 3, pp. 211–225, 2015.
[40] D. L. Olson and D. Delen, Advanced data mining techniques. Springer Science &
Business Media, 2008.
[43] E. Vorontsov, C. Trabelsi, S. Kadoury, and C. Pal, “On orthogonality and learning
recurrent networks with long term dependencies,” arXiv preprint arXiv:1702.00071,
2017.
[44] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint
arXiv:1412.6980, 2014.
[45] I. Loshchilov and F. Hutter, “Sgdr: stochastic gradient descent with restarts,” arXiv
preprint arXiv:1608.03983, 2016.