U M L T B A S - I BOT: Sing Achine Earning O Uild EMI Ntelligent
U M L T B A S - I BOT: Sing Achine Earning O Uild EMI Ntelligent
U M L T B A S - I BOT: Sing Achine Earning O Uild EMI Ntelligent
Research and development entity, Data Science team, Palo IT, Paris, France
ABSTRACT
Nowadays, real-time systems and intelligent systems offer more and more control interface based on voice
recognition or human language recognition. Robots and drones will soon be mainly controlled by voice.
Other robots will integrate bots to interact with their users, this can be useful both in industry and
entertainment. At first, researchers were digging on the side of "ontology reasoning". Given all the
technical constraints brought by the treatment of ontologies, an interesting solution has emerged in last
years: the construction of a model based on machine learning to connect a human language to a knowledge
base (based for example on RDF). We present in this paper our contribution to build a bot that could be
used on real-time systems and drones/robots, using recent machine learning technologies.
KEYWORDS
Real-time systems, Intelligent systems, Machine learning, Bot
1. INTRODUCTION
We present here our contributions within Palo IT [17] for a year of research & development
activities. The main part of the R&D entity works on Data Science trends and intelligent systems
based on machine learning technologies. The aim of this project was to create a semi-intelligent
bot. This bot must be able to analyse facts, reason and answer questions using machine learning
methods. This paper aims to provide an overview on our work during this project. It consists of
four parts. In the first part, we present the context of the project, the problematics, some related
works, the objectives. In the second part, we present details of the tasks that we deal with during
our research project. Tasks about the implementation and testing of different methods of text
mining - by applying these methods on text data in French - and the test results are detailed in the
third part of this paper. Finally, we concluded with our analysis of what we have acquired through
this project and future scope.
2. GLOBAL OVERVIEW
We present here the context, some related work, issues, and our objectives.
2.1. Context
The amount of text data on the web, or stored by companies is growing continuously. In order to
exploit this wealth, it is essential to extract knowledge from such data. The discipline dealing
DOI : 10.14810/ecij.2017.6401 1
Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017
with this type of data is called "Text Mining" includes several issues such as search indexing of
documents, summary generation, creation of bots, etc. The work done during our project is part of
the enrichment of the Palo IT textual and data analysis research. It aims to create a semi-
intelligent bot. For this project an internal R & D was launched. This project "PetiText"
(translated SmallText in English; petit=small) is based on the analysis and reasoning on short
sentences to detect new facts and answer questions. It involves an analysis of data from text
corpora which allows to:
• extract targeted information, sorted and added value for companies using algorithms
• search for similarities and identifying causal relationships between different facts
2.2. Issues
Faced with the growing demand of Palo IT customers to extract knowledge from their textual
data, the PetiText R & D project was launched. Indeed, these customers possess documents and
tools for collecting reviews and customer complaints or employees. Hence the need to design and
implement a tool for analysing this type of data. Text data poorly used by most companies,
represent a wealth of information. Their analysis is a means of decision support and a strategic
asset for companies. Study of Text Mining existing products shows a major flaw for processing
text data written in French. This defect consists of the almost total absence of "open source"
libraries incorporating the semantics of the French language. Indeed, unlike the English, we found
that most libraries and tools used globally (as Clips [1], NLTK [2], etc.) to treat this type of
problem are not reliable when it comes to deal in French documents. For these reasons it was
decided to set up a new tool combining several text analysis methods that treats the French
language, which allows the machine to reason as a little boy of two years.
Some authors have proposed to deal with those issues by deep learning and ontology reasoning. It
was the case of [14] Patrick Hohenecker and Thomas Lukasiewicz, from Department of
Computer Science University of Oxford, introduce a new model for statistical relational learning
that is built upon deep recursive neural networks, and give experimental evidence that it can
easily compete with, or even outperform, existing logic-based reasoners on the task of ontology
reasoning. Other authors recently have proposed in [15] a model that builds an abstract
knowledge graph on the entities and relations present in a document which can then be used to
answer questions about the document. It is trained end-to-end: only supervision to the model is in
the form of correct answers to the questions. Thuy Vu and D. Stott Parker [16] describe a
technique for adding contextual distinctions to word embeddings by extending the usual
embedding process — into two phases. The first phase resembles existing methods, but also
constructs K classifications of concepts. The second phase uses these classifications in
developing refined K embeddings for words, namely word K-embeddings. We propose to
2
Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017
complete these propositions with an approach to connect human language and knowledge bases
(here we start with French but it must be same thing for other languages).
2.4. Objectives
The aim of our project was to help in all the steps of creating a semi-intelligent bot. This bot will
learn facts from existing textual resources by conducting a thorough analysis. It must then be able
to deduce new facts and answer open questions. To achieve this, a combination of different
methods of textual data analysis was used. These methods can be grouped in three axes:
• Frequency analysis: using metrics based on the detection of global information and
characteristics of a text (keywords, rare words, etc.).
• Semantic Analysis: based on the analysis of the context and emotions to contextualize a
given text.
The "petiText" is a project that is part of Data Science Palo IT activities leaded by three PhDs: a
data science expert as supervisor Mr. Patrick LAFFITTE, then Mrs. Raja HADDAD and Mr.
Yassin CHABEB. Thanks to the wealth of existing libraries in python dedicated to machine
learning, the choice of that language was obvious. Regarding data storage, we used ZODB (Zope
Object DataBase) which is a hierarchical and object-oriented database. It can store the data as
python objects. We used Gensim [3] and Scikit-learn [4] are two python libraries that implement
various machine learning methods and facilitating the statistical treatment of data. These methods
of learning and statistical data computations require considerable material resources: due to the
volume of data to process and especially the time computations, therefore, two remote OVH
machines were rented. These machines have the following configurations: Machine n°1: 8 CPUs,
16Gb of RAM and a GPU; Machine n°2: 16 CPUs and 128Gb of RAM.
The objective of this set is to create and implement a formal logic model. This was achieved by
combining the results of two tools. The first is used to apply a logical model to a set of sentences
modelled as relationships between objects. These were extracted through the use of the second
tool is the CoreNLP. Appendix A shows an example of application of our logic model on a set of
sentences about the family universe/field/domain.
3
Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017
3.1.1. Building a logic model
This model is based on interpreting the world as a set of facts, and every fact is the relationship
between two or more objects. Knowing that the objects of a sentence are a fact (a relationship), it
is sufficient to apply logical rules that we have defined to derive and generate new information
between different objects.
A simple example:
To extract new information from a given set of facts, we have implemented a set of logical rules.
When a rule can generate a new fact (called conclusion), or a hypothesis. Each hypothesis can
become a conclusion if new facts arrive and validate it. The logical rules that we have defined in
this part:
• Conclusions:
- If obj1 is obj2 OR obj3 AND obj2 is obj2-1 OR obj2-2 AND obj3 is obj3-1 OR obj3-
2,
- If obj1 is obj2 OR obj3 AND obj2 is obj4 AND obj3 is obj5, Then obj1 is obj4 OR
obj5.
- If obj1 is obj2 AND obj3 is obj2, Then obj1 AND obj3 are obj2.
• The hypotheses:
- If obj1 is obj2 OR obj3 AND obj3 is obj2, Then obj3 is probably obj1.
- If obj1 is obj2 OR obj3 AND obj4 is obj5 AND obj2 Then obj4 is probably obj1.
- If obj1 is obj2 OR obj3 AND obj2 is probably obj4, Then obj1 is probably obj4.
4
Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017
3.1.2. CoreNLP
The construction of logic models from the input sentences needs first level of understanding from
a syntactic and grammatical analysis. For this we used a natural language processing tool (NLP)
entitled CoreNLP [5] of Stanford University. CoreNLP fetches lemmas of words, it identifies
their basic form from plural or conjugation or grammatical declinations. It also explains the
overall structure of the sentence analysing subject, verb, etc. It also acts as a parser adapting to
the
he turn of phrases. Figure 1 shows an example of analysis of a sentence using CoreNLP. When
edges represent the overall structure of the sentence, then the blue tags represent lemmas of
different words, and those in red, give the grammatical class of a wor
word.
In this project we used this technique (Web Scraping) to store a maximum of French-language
French
definitions. This is to integrate the intelligent aspect to our bot. So that it can answer questions,
the meaning of words and phrases is very important, so the choice of web sites and dictionaries of
synonyms has emerged. To scrabble definitions from various dictionaries of sites we used XPath.
This library allows python to extract information (elements, attributes, comments, etc.) of a
document through the formulation of terms including
These definitions and synonyms have served as a learning base to allow the bot to learn the
meaning of words and phrases in different contexts
contexts.
5
Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017
During this research project we have implemented and tested several models of text data as
learning the TF-IDF, the word embeddings. The application of these two models to learn allowed
us to have a clear idea about the use cases for each model.
3.3.1. TF-IDF
The Extracting relevant information from textual sources based on statistical models. These are
used to detect rare words (therefore the most significant), to eliminate less significant as stop-
words that does not depend from the context. The most commonly used technique for this is the
TF-IDF [8] (Term Frequency - Inverse Document Frequency).
To build a learning model based on the word embeddings every sentence is converted into vector
of real values. The application of a model based on the succession of several layers neural
networks to detect semantics, contexts and relationships between them, and the classification of
new texts through an unsupervised learning. To apply the word embeddings on a textual corpus
we used different python libraries like: Gensim (Word embeddings) and Scikit-learn. They offer
all necessary methods (Doc2Vec, word2Vec...).
We assessed the reliability of the two learning models cited earlier. This evaluation was to be
applied on a data set (20 newsgroups) 20,000 items and categories 20 (1000 items per category).
For this test, TF-IDF comes at a score of 58% while the word embeddings gives a score of 98%.
This result supported our choice to use the word embeddings as a learning model for our bot.
6
Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017
4. INTELLIGENT BOT
To develop an intelligent (or semi-intelligent) bot that can analyse the facts, learn and answer
questions, we combined the different parts detailed in the previous section. So, our bot combines
between the quality of text processing, classical and machine learning logic (semantics). Our bot
mainly combines three components: logic, semantics and training/learning base. The bot must be
logical, intelligent, autonomous and must render services. In our case the services are to answer
questions and generate different contexts, conclusions and assumptions from the facts given to
enrich its knowledge base. Aside from the natural language processing that must be done
automatically with the arrival of new developments, several challenges need to be resolved to
reach reliable conclusions.
The bot must be logical in its calculations and answers as a human, that's why a conventional
logic model was developed. This model will allow to validate or not the facts that are available,
based on the rules of the basic classical logic. This will argue about the facts and generate
conclusions to be potentially added to the knowledge base of the bot. Before moving to the
logical model, the facts, which are only phrases, must first be processed and decomposed by a
natural language processing tool. Question complexity and execution time, the logic model is the
simplicity of classical logic rules and is very fast because the facts are relatively simple sentences
composed of a single verb, subject, complement, conjunction.
The knowledge base is made up of all findings and the operative events that are valid. It is
enriched when the facts arrive and conclusions are generated. The bot is then autonomous and no
longer depends on human intervention as textual data sources feed the bot with new facts. Figure
2 summarizes the developments of the validation process. It must still that moment to launch the
bot the first time, he already has a basic knowledge they will need to make learning so he can
detect and recognize the context and thus be able to work on developments that arrive. The bot
will look like a man (boy) that reasons and learns so the most logical is from a knowledge base
that is based on dictionary definitions (for each word, several definitions depending on context)
and synonyms. These are collected by the Scraping technique from several specialized websites.
The departure of the knowledge base contains 38,565 definitions of 226,303 words.
Now that we have a logic model and an initial knowledge base, the challenge is to find a
technique that will allow the bot to detect the context of new arriving facts and rank them, and
this is exactly when the word embeddings is used. Recent Deep Learning idea is that the
7
Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017
approximate meaning of a word or sentence can be represented as a vector in a multidimensional
space. The nearer vectors represent similar meanings. To do so, we used Gensim for Python
designed to automatically extract topics Semantic documents, as efficiently as possible. Gensim is
designed to process raw digital and unstructured text. The algorithms in Gensim, such as Latent
Semantic Analysis, latent Dirichlet allocation and random projections, describe the semantic
structure of documents. The latter is extracted by examining the statistical models of co-
occurrences of words in a corpus of training documents. These algorithms are unsupervised,
which means that no human input is required. The only input to these algorithms is the text
document corpus for training the model. Gensim allows:
Word embeddings are one of the most exciting areas of research in the field of Deep Learning. A
word embeddings WE: words → ℝn is a parameterized function that maps words in a certain
language to high-dimensional vectors (100, 200 to 500 sizes). Essentially, each word is
represented by a numerical vector. For example, we might find (poire means pear in French):
The purpose and usefulness of the word embeddings consist in grouping the vectors of similar
words in a vector space. Mathematically, it detects similarities between different vectors. These
are digital representations describing the characteristics of the word, such as context. The word
embeddings has several variations including:
• the Word2vec
• the Doc2vec
Word2vec is a two-layer neural network that processes the text. The input to the network is a text
corpus and its output is a set of vectors. These include semantic features to the words in the
corpus. Word2vec is not a deep neural network in itself, but it is very useful because it turns text
into digital form (vectors) that the deep networks can understand. Figure 3 summarizes the
process used in word2vec algorithm. Words can be considered as discrete states, then simply
search the transition probabilities between these states, such as, the probability that they occur
8
Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017
together. In this case we will have close vectors for the words in a similar context (plus the cosine
is close to 1, the more
re the context of these words is similar)
similar).
In the Mikolov [12] introduction about learning word2vec, each word is mapped to a single
vector, represented by a column in a matrix. The column is indexe
indexedd by the position of the word in
the vocabulary. The concatenation, or the sum of the vectors is then used as a characteristic for
predicting the next word in a sentence. The Figure 4 gives an example of word2vec
concatenation.
If we take into consideration enough data, use and contexts, the word2vec can make very accurate
assumptions about the meaning of a word based on past appearances. These assumptions can be
used to establish an association of words with other words in terms of vectors. For example:
W('woman') - W('man') ≈ W('queen') - W('king')
9
Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017
Word2vec form words by other words it detects in the input corpus. Word2vec includes two
methods:
2. Skip-gram
gram with negative sampling or skip
skip-gram:
• This method can also work well with a small amount of training data. It can also
represent words or few sentences.
10
Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017
The Doc2vec realize a learning on a large set of documents, and creates a model of vector spaces.
In this model, each document is a vector space composed by the vectors of words. Thus, to have
the degrees of similarities, the method "most_similar ' uses the cosine between vectors, the higher
the cosine is close to 1, the higher the similarity is high. Figure 6 illustrates the steps of
unwinding the doc2vec algorithm. To apply Doc2Vec, two methods can be used:
It considers the vector of paragraph with the vectors of paragraph words (Word2vec) to predict
the next word in a text. Using this distributed memory model (DM) comprises:
• Predict the next word using the context of the word + the paragraph vector.
• Drag the window contexts on the document while the paragraph vector is fixed (therefore
distributed memory)
11
Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017
This method (DBOW) ignores the context words at the entrance. One paragraph vector predicts
words in a small window. This method requires less storage and it is very similar to the method of
the Skip-gram word2vec [12]. This method is less efficient than DM. However, the combination
of the two methods DM + DBOW is the best way to make a Doc2vec.
12
Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017
4.4.1. Learning time et scores
With a machine having 16 CPUCPUs and RAM 128Gb,, the learning period of the model on 38,565
definitions of 226,303 words varies proportionally to the model parameters, the size of the
generated vectors and the number of iteration of llearning.
Table 1 shows the results of different tests we performed. We note that the learning time and its
quality are proportional to the number of iterations of the algorithm. We also find that the best
results are obtained with 200 iterations and vecto
vectorsrs of size 200. Figure 9 illustrates the evolution
of the score according to the number of iterations
iterations.
Figure 8. Scores obtained according to the size of the vectors and the number of iterations
The vector inferred by the model parameter vector size 200 corresponds to:
13
Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017
[-2.53752857e-01 -2.71043032e
2.71043032e-02 4.33574356e-02 -9.83970612e-02
02 2.55723894e-01
2.55723894e -
7.85913542e
7.85913542e-02 02--5.09732738e...]
The Figure 9 shows the tags that different models have found by calculating the cosine between
the inferredd vector of the previous sentence, and the vectors of all the definitions in the vector
space model.
Figure 9. Search results on the context of the preceding sentence regarding the word family
For this example, we notice that the models having size vectors 200 and number of iterations 100
and 200 are the best performers, since they have been the only ones that have returned the tag
"family" that corresponds to the definition of the input sentence.
Figure 10 is a screenshot of our first bot prototype. It is a French PetiText reasoner about three
universes/contexts: family, abstract objects, biological organisms
organisms.
Then, we choose the family definition context then we list facts that we give to the bot.
Fact 1: a person is a man or a woman
Fact 2: a woman is female
…
In Figure 11, the bot starts reasoning on the facts in the order and generating conclusions and
hypothesis
Generated fact 1: parents and kids are parts of the same family
Generated fact 2: a father is a male
…
The same process was tested in real
real-time
time interaction and it works, we can also add new fact in
the bottom of the screen.
14
Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017
3. CONCLUSIONS
We have presented here the different steps and tools combined in order to build a semi-intelligent
bot based on machine learning technologies. With a solid foundation of learning, a logic model, a
15
Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017
word embeddings with good scores (≈ 84%) and improvements in current and future, we hope to
combine the three parts in order to have a functional bot. We use advanced tools and
technologies, they are very recent and widespread on the Data Science: the python programming
language, jupyter notebook for a complete development environment, Gensim for word
embeddings, and advanced tools for natural languages processing such as Clips and CoreNLP
Stanford. During the project, we have used namely the methods of classification and clustering,
Data Mining and Text Mining, Sentiment Analysis, and evaluations of the quality of classifiers
(Reminder, Precision, F-Measure). Then come the current systems, languages and paradigms for
Big Data and Advanced Big Data Analytics, that showed us especially the world of Big Data,
many use cases and market opportunities. We have tested some big data architectures, including
Hadoop and Spark with Python languages, Java and Scala.
As the subject of the project is part of the R&D activities, the ultimate goal was clear and
understandable, but in practice, problems on understanding how to achieve the objectives arise.
Indeed, understanding how to get the value and signification of a text is not easy. Then follows
the difficulty of understanding the functioning of the used methods, which algorithms to use and
when to use it. Development work and tests, reading papers and publications of other researchers,
and the documentation, all that was done in order to understand the subject, to progress and have
good results (≈ 84%), it was not the case at the beginning (≈ 60%).
After months of data processing (scraping, natural language processing, data cleaning, data
standardization...) and the development of logic and learning models (Word embeddings), the
first results / satisfactory scores were obtained. Now, we plan to make improvements and use of
new techniques for the coming months. We will use LSTM (Long Short-Term Memory), an
architecture of recurrent neural networks (RNN) which should further improve the quality of
prediction and classification. We plan also to finalise the integration of the bot within a Parrot
drone that we controlled by voice thanks to a previous research project in order to complete a
global interactive real-time interface between human and drones/robots [18]
ACKNOWLEDGEMENTS
We like to thank everyone that helped us during the current year.
REFERENCES
[1] CLiPS, https://www.clips.uantwerpen.be/PAGES/PATTERN-FR.
16
Electrical & Computer Engineering: An International Journal (ECIJ) Vol.6, No.3/4, December 2017
[9] Quoc V. Le & Tomas Mikolov (2014) Distributed Representations of Sentences and Documents.
CoRR abs/1405.4053
[12] Quoc V. Le & Tomas Mikolov (2014) Distributed Representations of Sentences and Documents.
CoRR abs/1405.4053
[14] Patrick Hohenecker, Thomas Lukasiewicz (2009) “Deep Learning for Ontology Reasoning”, CoRR
abs/1705.10342
[15] Trapit Bansal, Arvind Neelakantan, Andrew McCallum, (2017) “RelNet: End-to-end Modeling of
Entities & Relations”, University of Massachusetts Amherst, CoRR abs/1706.07179.
[16] Thuy Vu & Douglas Stott Parker, (2016) “$K$-Embeddings: Learning Conceptual Embeddings for
Words using Context”, HLT-NAACL, pp 1262-1267
AUTHORS
17