Topic Extraction For Ontology Learning
Topic Extraction For Ontology Learning
Topic Extraction For Ontology Learning
Laboratoire ERIC
5 av. P. Mendès-France
France
e-mail:{Marian-Andrei.Rizoiu; Julien.Velcin}@univ-lyon2.fr
Topic Extraction for Ontology Learning 2
Abstract
This chapter addresses an issue crucial to ontology learning: the extraction of topics from
text corpora. The first part is dedicated to an overview of some solutions present today in
the literature. These solutions deal mainly with the inferior layers of the Ontology
Learning Layer Cake. They are related to the challenges of the Terms and Synonyms
layers. The second part shows how the same pieces can be bound together into an
integrated system for extracting meaningful topics. Whereas the extracted topics are not
full concepts yet, they constitute a really convincing approach in concept building and
therefore in ontology learning. The chapter ends with some perspectives that emerge
The last years have seen an intensive research on automatic ontology construction,
especially from natural language texts. Special attention has been given to texts found on
the Web, as they have certain specific features that we will present later in this chapter.
and Magnini (2005) divide the process of ontology learning in a chain of different phases,
the output of each phase being the input of the following one, a described in chapter
(place here your reference to the chapter presenting the Ontology Learning Layer Cake) of
this book. An analysis of the state-of-the-art in terms of ontology learning at each of the
propose to take the reader into a descending overview of the inferior layers of the
Ontology Learning Layer Cake (Buitelaar et al. (2005)), highlighting the challenges at
each step. Starting from the observation that ontologies are dynamic, that they keep
evolving mainly by means of refining concepts or replacing old concepts with new ones, a
special attention must be paid to the “concept” layer. Therefore, automated ontology
learning is closely connected to concept learning. As shown in Cimiano et al. (2006), the
main approach toward learning concepts and their taxonomy - the hierarchical relations
clustering. This approach generally outputs a concept tree, each level being more specific
than the previous one. At each level, the collection of terms is partitioned around each
levels: big under the root and smaller as we reach the leafs. Examples of algorithms
created towards this purpose are the well-known COBWEB (Fisher (1987)) and the more
Topic Extraction for Ontology Learning 4
recent WebDCC (Godoy and Amandi (2006)). While this approach is promising and has
shown good results, the resulted hierarchy is still very noisy and dependent on both the
quality of extracted terms and their frequency in the text collection. Therefore,
researchers have tried to improve the quality by allowing the expert to validate and guide
the process. Others touched the field of semi-supervised learning techniques by making
the superior layers of the cake on the quality of terms, we descend another step into the
ontology layer cake. At the terms and synonyms layers, new challenges arise, like
extracting pertinent, non-ambiguous terms and disambiguation. So, the purpose of the
lower layers of the cake is to extract terms and regroup synonyms under the same concept
There are other approaches that pass though topics on the way towards concepts.
Just like the later (see concept definition in Buitelaar et al. (2005)), topic definition is
controversial. While some researchers consider a topic being just a cluster of documents
that share a thematic, others consider topics as an abstraction of the regrouped texts that
idea emerging from the texts. Table 1 presents an example of the topics that can be
extracted from text. More details about some experimentation made with this system will
be presented later, in section “Combining the two phases into an integrated system for
extracting topics”. So, a topic is not yet a concept as it is an abstraction of the idea
behind a group of texts rather than a notion in itself. While the difference between the
two is subtle, evolving a topic into a fully fledged concept is still to be achieved and the
reader will find a couple of ideas in the section “Conclusions and Perspectives”.
algorithms divide the input set of texts into groups that are similar in terms of their
thematics (politics, economics, informatics etc), meaning that all the texts in a group
approach the same domain and there is a visible distinction between them and the texts
from the other groups. We chose to present these approaches not only from the point of
view of topic extraction,but also regarding their usage at the different layers in the
Most of these clustering algorithms present at the output a center for each of the
created groups This center is often called a centroid and summarizes the common part of
all the documents in the cluster. In fact, the centroid can be viewed as an abstract
representation of the topic denoted by that group, a prototype. But in order to become a
topic, it needs to be assigned meaningful, human readable phrases or words to label the
centroids. What makes a good topic name? In Roche (2004) the problem is presented in
detail. One of the first things that must be taken into consideration is that words have the
propriety of polysemy, meaning that the same word can have different meaning in
different contexts. For example each of the words “data” and “mining” have different
words than the phrase “data mining”. Seen from the light of this observations, we would
than one group. In this way, a text that talks about de “economical outcomes of a political
decision” can be part of both the “politics” groups, as well as the “economics” group.
This second phase gives the topic a linguistic materialisation. It allows to go from
an abstract centroid, a prototype that summarises the common part of all documents in
Textual Clustering
Given the property of polysemy of terms, an important aspect of the synonyms layer
is the identification of the appropriate sense of terms, which determines the set of
synonyms that have to be extracted. Buitelaar et al. (2005) present the two main
The same authors state that “the second group of algorithms, which are based on
statistical measures used mainly in Information Retrieval, start from the hypothesis that
terms are similar in meaning to the extent in which they share syntactic contexts (Harris
(1968))”. Therefore, performing textual regrouping on the entire collection of texts, would
place texts that share the same content into the same group. This would lead to
In the following sections, we have divided the textual clustering algorithms into
categories based on if they are able or not to create overlapping groups. If terms can
have different meaning depending on the context - polysemy - it is only natural to allow
them to be part of more than one group. In this way, the clustering algorithm would not
only find synonyms, but its output could also be used for disambiguation. It is worth
mentioning that most of today’s word sense disambiguation algorithms, like the one in
While some of the solutions presented below were created specifically for text
mining - like LDA -, others were designed for a general purpose clustering. They partition
individuals into groups based on the similarity of their features. But all of these methods
Topic Extraction for Ontology Learning 7
can be used for textual clustering by translating the documents into the Vector Space
Crisp solutions. We present these two algorithms principally for the didactical
reasons. KMeans (Macqueen (1967)) is one of the most well-known clustering algorithm.
Extensive work has been done and numerous papers has proved its accuracy for various
the dataset (named “crisp” clustering). To do this, the algorithm iteratively optimizes an
objective criterion, typically the squared-error function. In the case of text mining and
information retrieval, the cosinus distance can be used in order to calculate similarities
hierarchical variant of KMeans which has been proved to be more accurate than KM for
the task of text clustering. It is based on a top-down algorithm that divides, at each step,
the documents into two crisp sub-clusters. For instance, at the first level, the whole corpus
is divided into two subsets according to multiple restarting 2-means. For the next level,
one of the subsets is chosen (for example, the bigger one) and split: globally, we obtain
three text clusters. The process is iterated until a stopping criterion is satisfied, e.g. a
fixed number K of clusters. The final output of BKM can be seen as a truncated
dendogram.
Of course, there are many other systems offering clustering solutions, some of them
even having topic extraction capabilities, like AGAPE (Velcin and Ganascia (2007)). But,
their main inconvenience is that they output a crisp partition, where each document can
be part of only one group. While, from the point of view of the Ontology Learning Cake,
they can be used for regrouping synonyms, it can not be used for disambiguation. From
the topic extraction point of view, they do not allow overlapping for the clusters, forcing a
belonging to all clusters, rather than belonging completely to just one or several clusters.
Thus, a document on the edge of a cluster, is associated with it in a lesser degree than a
document in the cluster center. For each document d, we have a coefficient giving the
degree (similar to the probability) of being in the kth cluster uk (d). Still, fuzzy logic
threshold θ and considering that if uk (x) > θ then the document d is in the kth cluster.
fuzzy logic. The main differences between the Fuzzy KMeans and the original version are :
• the way the objective function is calculated - every document contributes to the
clusters.
ground. In LINGO (Osinski (2003)), LSI is used for the clustering purpose in conjunction
with the Suffix Array (Manber and Myers (1990)) frequent phrase extraction algorithm,
which will be detailed in section “Combining the two phases into an integrated system for
extracting topics”. The main idea of the algorithm is to decompose the term/document
containing the left and right singular vector of A, and S a diagonal matrix, with the
singular values of A ordered decreasingly. If we keep only the k highest ranking singular
values and eliminate the rest, along with the corresponding columns in U and lines in V ,
parameters, which is arbitrarily set by expert. The LSI approach allows an automatic
Topic Extraction for Ontology Learning 9
approximation of the number of clusters, based on the value of singular values of the
LINGO, the Frobenius norm of the A and Ak matrices is used to calculate the percentage
distance between the original term / document matrix and its approximation.
orthogonal basis for the document space. According to mathematical vectorial space
theory, every component of the space, in our case every document, can be expressed as a
di = α1 e1 + α2 e2 + ... + αk ek
The elements el , l ∈ {1..k} of the base can be considered as the centers of the classes and
the formula above is highly similar to the fuzzy approach described earlier, the document
Latent Dirichlet Allocation (D. M. Blei, Ng, Jordan, and Lafferty (2003)) is a
probabilistic generative model designed to extract topics from text corpora. It considers
documents as collections of words and models each word in a document as a sample from
a mixture model: each component of the mixture can be seen as a “topic”. Thus each
word is generated from a single topic, but different words in a document are generally
except that in LDA the topic distribution is assumed to have a Dirichlet prior. This point
and the impossibility of making inferences on new documents. More precisely, LDA is
α and β are the basis of two Dirichlet distributions. The first Dirichlet distribution is
dedicated to the generation of the topic mixture for each of the |D| documents. The
second Dirichlet distribution is dedicated to the generation of the word mixture for each of
the K topics. Each topic is then a distribution over the W word of the vocabulary. The
generative process is the following: for each word wd,i of the corpus, draw a topic z
depending on the mixture θ associated to the document d and then draw a word from the
topic z.
Note that words without special relevance, like articles and prepositions, will have
roughly even probability between classes (or can be placed into a separate category). As
each document is a mixture of different topics, in the clustering process, the same
document can be placed into more that one group, though resulting in a (kind of)
the hyper-parameters α and β, is rather difficult because the posterior p(θ, z/D, α, β, K)
topics and with which proportions? What part of the document is associated to which
topic? Depending on the likelihood p(d/Θ), does a new document describe an original
mixture of possibly many topics. While this may be interesting for describing the
documents, in the case of clustering, it could lead to a situation where each document
belongs, more or less, to many clusters (similar to a fuzzy approach). An issue is therefore
Topic Extraction for Ontology Learning 11
to be able to choose a finite (and hopefully short) list of topics to be associated to the
• This method does not present a center for each cluster, but a distribution of the
document over the topics. This could make it difficult to associate a readable name to the
cluster. Note that recent works relative to LDA are seeking to find useful names using
• As in the other presented methods, this probabilistic approach does not solve the
classical problem of finding the global optimum and choosing the number K of topics. For
the latter, some methods are proposed inspired by the works in model selection
(Rodrı́guez (2005)).
Numerous works have followed the way designed by Blei et al. to deal with various related
issues: extracting topic trees (D. Blei, Griffiths, Jordan, and Tenenbaum (2004)),
modeling topics through time (D. Blei and Lafferty (2006)), finding n-grams instead of
recent extension of the well-known K-Means. It shares the general outline of the
choosing k centroids (centers) from the data set, and then iterating these two steps:
document can be assigned to multiple clusters. If in K-Means each document was assigned
to the centroid that was closest to him, in terms of cosine distance ( detailed in subsection
“Vector Space Model” ), OKM calculates an image of the centroids, adding the document
to clusters so that its distance with its image is minimal. This image is the Gravity Center
Topic Extraction for Ontology Learning 12
Therefore, the function that OKM tries to minimize is the distortion in the dataset:
1 X ³ (i) (i) ´2
N
distorsion (Π) = d X ,Z
NK
i=1
OKM inherits from K-Means most of its drawbacks ( its powerful dependence on
the initialization and the number of clusters that must be arbitrary specified by the
expert) and its advantages ( linear execution time, good performance when working with
texts). But it outputs directly an overlapping partition of the data set, without the need
of setting a threshold parameter necessary for fuzzy approaches as those presented before.
This is the main reason why it was chosen for the clustering task in the topic extraction
algorithm that will be presented in detail in section “Combining the two phases into an
integrated system for extracting topics”. This threshold, necessary for transforming fuzzy
than one cluster. This way, terms that have different meanings depending on the context
weights internally and achieves even better performances in terms of precision, recall
and FScore. At the same time it limits the overlapping in the clusters issued by OKM,
approach, a review of which can be found in Parsons, Haque, and Liu (2004).
Topic Extraction for Ontology Learning 13
Keyphrase Extraction
The first level of the Ontology Layer Cake is the Terms Layer. This layer is a
prerequisite for all aspects of ontology learning from text (Buitelaar et al. (2005)). Its
(Cimiano et al. (2006)) Buitelaar et al. (2005) observe that although “the literature
provides many examples of term extraction methods that could be used as a first step in
ontology learning from text”, still “much of the research on this layer in ontology learning
has remained rather restricted”. Cimiano et al. (2006) also considers that automatic term
extraction techniques have not yet reach their maturity, since the “resulting list of
Topic extraction, on the other hand, shares the same need relevant, unambiguous
terms or phrases to synthesise the thematic of the group of documents associated to the
topic. The algorithm presented in the section “Combining the two phases into an
integrated system for extracting topics” has 3 phases for extracting topics: overlapping
topic name is a complete phrase, that contains all the words that have a special meaning
together ( like “data mining” ) and all the prepositions and articles that make sense to the
one or more words that is considered highly relevant as a whole”, while a keyword is “a
single word that is highly relevant” Hammouda, Matute, and Kamel (2005).
The literature presents several ways of classifying the term extraction algorithms.
Hammouda et al. (2005) divides the approaches into two categories, based on the learning
learning task, often regarded as a more intelligent way of summarizing the text. These
approaches make use of the knowledge of the field expert, demanding him to validate, at
Topic Extraction for Ontology Learning 14
each step for the incremental algorithms, the extracted phrases. While in this way the
results are less noisy - only interesting collocations will be extracted -, involving a human
supervisor can make the whole process slow, expensive and biased towards the specific
field (eg. microbiology). These approaches face problems when demanded to process large
datasets of general purpose texts. Examples: ESATEC Biskri, Meunier, and Joyal (2004),
• The approaches that extract the keyphrases from a set of documents, which is an
unsupervised learning technique, trying to discover the topics, rather that learn from
examples. Not depending on a human expert makes this kind of approaches scale well to
large datasets. Still, their major drawback is the almost exponential quantity of extracted
phrases, most of them having no real interest for the specific domain of the application,
leading to a noisy output. Still, there are techniques to ameliorate the their precision,
some of them presented later in this section. Examples: CorePhrase Hammouda et al.
(2005), Armil Geraci, Pellegrini, Maggini, and Sebastiani (2006), SuffixTree Extraction
Osinski (2003).
Other researchers (Roche (2004); Buitelaar et al. (2005); Cimiano et al. (2006))
Natural Language Processing research. They employ linguistic processing like phrase
In Roche (2004), three linguistic systems are presented: TERMINO, LEXTER and
INTEX & FASTR. All these systems make use of morphological and syntactic
informations about the words in the texts. The POS tagger (Part-Of-Speech) tries to
recognize whether the word is a noun, adjective, verb or adverb, and tries to characterize
it morphologically (number, person, mode, time etc). Based on this information, the
Topic Extraction for Ontology Learning 15
lematisation process extract the radix of the word (masculine single - for nouns, infinitive
- for verbs). With the texts tagged, each system has each own approach toward
the sentences, and then certain patterns are used to uncover the keyphrases (ex: <Head>
information to extract from the text nominal groups and then searches for dis-ambiguous
noisy output, but they are also vulnerable to multilingual corpora and neologisms. They
also have the tendency of being adapted to stereotypical texts (texts from a specified
narrow field) (Biskri et al. (2004)). In other words they do not adapt or scale easily to
new fields or datasets. This makes them particularly difficult to work with when dealing
with term extraction from texts found on the internet - Web Mining. Documents on the
internet do not necessarily have a scientific writing style, nor do they always respect the
official spelling.
Also, the use of linguistic methods leads to almost exponential explosion of the
numbers of collocations extracted when the size of the corpus increases. That is why
usage of method based only on linguistic information could prove prohibitive. Though,
this could be dealt with in a certain measure by use of statistical filters (see subsection
“Hybrid approaches”)
for term indexing (Salton and Buckley (1988)) and make use of numerical (statistical)
informations in order to discover the topics. For each couple of words in the text, the
statistical measure is calculated. This allows to quantify the dependency between the two
words in the binary collocation, also called bigram. A well-known and used such measure
Topic Extraction for Ontology Learning 16
P (x, y)
IM (x, y) =
P (x)P (y)
where P (x) and P (y) are the probabilities that the word x and, respectively, y appear in
the text, while P (x, y) represents the probability of the words x and y appearing together
as neighbours. This allow us to calculate the dependence between two words that are one
around a word ( 5 words before + word + 5 words after). Once we have the tool for
extracting bigrams from the text, some authors (EXIT Roche (2004), ESATEC Biskri et
al. (2004)) propose ways of constructing ngrams, by combining iteratively the bigrams or
adding the an existing (n-1)gram another word, trying to obtain longer collocations that
Of course, many statistical measure have been proposed to calculate the strength of
the relationship between two words. In Anaya-Sánchez et al. (2008) the algorithms first
finds a set of terms that are frequent (over a minimum threshold). Than a set of pair of
these terms is created, retaining only the ones that score a minimum frequency. Only for
these pairs, the β-similarity is calculated and the set of documents for which the pair is
representative is constructed. In Silva, Dias, Guilloré, and Pereira (1999), Dias, Guilloré,
and Lopes (2000) , the authors consider that a special “glue” exists between words that
make them have a sens together. LocalMaxs is used in conjuncture with the Symmetric
Units and with Mutual Expectation (ME) measure for extracting Non-Continuous
Multiple Word Units. Nikos, Fakotakis, and Kokkinakis (2002) starts from the idea
that all n-grams can be constructed from bigrams and it is, therefore, essential to better
Experiments are performed on some of the most known and used measure’s performances,
judging by their ability to identify lexically associated bigrams. The measures compared
There are other approaches that do not rely on bigram detection and ngram
naturally lie at the intersection of the document cluster. The CorePhrase algorithm
phrase indexing graph structure, known as the Document Index Graph (DIG). It keep
document, its subgraph is matched with the existing cumulative graph to extract the
matching phrases between the new document and all previous documents. The graph
maintains complete phrase structure identifying the containing document and phrase
In LINGO Osinski (2003), a Suffix Array (Manber and Myers (1990)) based
keyphrase discovery is used. The algorithm tries to avoid extracting incomplete phrases (
like “President Nicolas” instead of “President Nicolas Sarkozy”) which are often
meaningless, it uses the notion of phrase completeness. A phrase is complete if and only if
all its components appear together in all occurrences of the phrase. For example, is the
phrase “President Nicolas” is followed in all occurrences by the term “Sarkozy”, than is it
not a complete phrase. Starting from this definition, right and left completeness can be
defined (the example above is left complete, but not right complete). Using a Suffix Array
data structure, the complete phrases can be detected and the ones that occur a minimum
number of times (frequent keyphrases) create the candidate set for the topics. A more
detailed explanation of this approach is presented in section “Combining the two phases
Topic Extraction for Ontology Learning 18
produce noisy results in the field of Information Retrieval (Biskri et al. (2004)), meaning
that among the extracted candidates, most of them pass the frequency threshold and get
good scores, but they are uninteresting from the topics point of view. Such expressions
can be comprised of common words (articles, prepositions, certain verbs, etc) like “he
responded that” or “the biggest part of the”, and they bring no new information. Such
phrases should be eliminated. For that, linguistic filters are very useful.
Some of the linguistic methods rely of certain keyphrase formats ( like <Subject>
tagger could be used as a final phase to filter out the noise from the candidates set
resulted from statistical extraction. From such a filter benefits the system XTRACT
(Smadja (1991)) which is comprised of three phases. In the first phase, bigrams are
extracted from a grammatically tagged corpus, using an eleven words window. The next
phase consists in extracting longer phrases if they are frequent in the text. These phrases
are called rigid noun phrases. The third phase is the linguistic phase. It consists in
<Adjective> <Noun> ) and afterwards, for each bigram, associate together longer
extracting topics
The literature provides many examples of systems that can extract topics from texts
(see previous sections). In order to exemplify to the reader how a topic extraction system
can be constructed by use of textual clustering and keyphrase extraction, in this section
we will present an approach proposed in (Rizoiu, Velcin, and Chauchat (2010)). We will
follow phase by phase the chain of processing that starts with a collection of texts (on-line
discussions, forums, chats, newspaper articles etc) and presents at the output on one hand
the topics extracted from the collection, under the form of readable expressions, and, on
phase, each of the documents in the data set are pretreated ( see subsection
“Pretreatment” ) in order to eliminate words that do not bring any information about the
thematic of the text, thus do not help in extracting the topics. At the same time different
inflected forms are brought to their stem in order to increase their descriptive value. After
this phase of pretreatment, the documents are translated into the Space Vector Model
(subsection “Space Vector Model”) using one of the term weighting schemes, in order to
Next, starts the process of clustering, using the OKM algorithm (see subsection
“Textual Regrouping”). Some of the reasons why Rizoiu et al. (2010) have chosen this
algorithm will be discussed in the following subsections. With the documents now
regrouped, we return to the original dataset in order to extract the complete frequent
keyphrases, using the a Suffix Array based algorithm. The procedure will be detailed in
subsection “Keyphrase Extraction. Name candidates”. The extracted phrases will be the
candidates for the each topic’s name. In the final phase (detailed in subsection
“Associating names to clusters”), the best candidates are chosen to represent the topics,
Topic Extraction for Ontology Learning 20
Pretreatment
topics is dependent on the quality of the input dataset and the pretreatment process. Its
purpose it to augment the descriptive power of the words, limit the size of the vocabulary
and eliminate certain words that are known to bring no useful information. It is
Stemming is the process through which inflection, prefixes and suffixes are remove
from each term in the collection. It is extremely useful especially for languages that are
are heavily inflected (like the verbs in French) and reduces words to their stems. This
guarantees that all inflected forms of a term are treated as one single term, which
increases their descriptive power. At the same time, bare stems may be more difficult to
be understood by the users, but since the stemmed version of the terms are never
presented to the user, it will not hinder their usage. Stemming is dependent on language,
but algorithms have been developed for most of the widely used languages. For English,
the most used is Porter’s stemmer (Porter (1980)), while for European languages one of
Stopwords (articles, prepositions etc) do not present any descriptive value, as they
appear in the texts independent on the thematic, so they are of no use for the clustering
process. Even more, they only make the corpus dictionary bigger, so that computation is
slower. Some term weighting schemes (such as Term Frequency) are especially vulnerable
to stopwords, so there elimination is compulsory. This is done using stopword lists for
each language.
Stemmed words are hard to read and stopwords improve the overall cluster names
quality for a human reader. Therefore keyphrase discovery requires the texts to be in their
Topic Extraction for Ontology Learning 21
natural form, so an non-treated version of the documents is also kept to be used for that
phase. Pretreatment is the only part of the algorithm that is language dependent. Adding
support for new languages languages is as easy as adding new stemming algorithms and
stopwords lists.
presented were not designed specifically for texts. That is why they require the text to the
transformed into an internal format before they can be used. This problem has been
addressed extensively in the Information Retrieval field and various models have been
proposed : Boolean Model compares True / False query statement with the word set that
describes a document, Probabilistic Model calculates the relevance probabilities for the
documents in the set. But the model that is most widely used in modern clustering
proportional on the degree of relationship between them. There are four major ways of
formula is:
1 if term i is f ound in document j;
ai,j =
0 otherwise.
In Osinski (2003) is shown that this scheme can only show if a word is related to a
appears in a document. While this is a better measure of the relationship between the
Topic Extraction for Ontology Learning 22
term (word) and the document, this scheme has the tendency of favoring longer
where ni,j is the number of occurrences of the considered term (ti ) in document dj , and
term in the whole corpus. It expresses the idea that a word should be less important if it
appears in many documents. In this way very common words, as prepositions, articles and
certain verbs and adjectives could be filtered out or, at least, given less importance.
|D|
IDFi = log
| {d | ti ∈ d} |
where |D| is the total number of documents in the collection and | {d | ti ∈ d} | is the total
number of documents where the term ti appears. In practice, IDF is never used alone, as
it lack the power to quantify the relationship between a word and a document. It also
favorises the very rare words, which are most of the time just noise. Instead, IDF is used
This scheme aims at balancing local and global occurrences. A high weight in TFxIDF is
reached by a high term frequency (in the given document) and a low frequency of the
term in the whole collection of documents. This weighting scheme hence tends to filter out
common terms.
Once that documents were translated into Space Vector Model, using one of the
schemes presented above, the similarity between two documents is usually calculated using
Pt
ai,j bi,j
similarity(a, b) = cos(~a, ~b) = qP i=1 qP (1)
t 2 t 2
i=1 ai,j i=1 bi,j
which can be interpreted as the geometrical angle between the vectors in the
multidimensional space.
Clustering
Having the documents pre-treated and translated into the Space Vector Model
using one the the four measures presented in subsection “Vector Space Model”, the
the importance of the polysemy of words and the need of term disambiguation for
Ontology Learning. We have shown that one solution for addressing this problem would
be the usage of an overlapping clustering solution, that would allow documents to be part
of more than one group. Therefore, the topic extraction system presented in Rizoiu et al.
(2010) regroups the text documents using the OKM algorithm (presented in subsection
The OKM implementation used by the authors of Rizoiu et al. (2010) respects the
original indications of Cleuziou (2007). The only change that we made was the stopping
condition. In the original form, the iteration process comes to an end when the partition
composition does not change any more - which means that a local minimum has been
reached. While from the clustering’s point of view the final result has been found, it does
not necessarily mean that centroids do not evolve over the next iterations.
Meaning that if the clusters do not change between 2 iteration, neither do the centroids.
In OKM the centroid update process is a little more complicated. In the documents -
cluster assignment phase, OKM does not search to minimize the variance between a
document and its centroid. It rather construct an image of centroids to which the
Topic Extraction for Ontology Learning 24
document is associated in such a way that the distance document - image is minimal.
Therefore, in the phase of cluster update, the centroid dependent not only on documents
in their own group, but also on the other centroids resulted from the last iteration. The
1 X 1 j
cj,v = P ∗ .x̂ (2)
1
xi ∈Rj δi2 δi2 iv
xi ∈Rj
A\{cj }
where x̂jiv in formula 2 has the following expression x̂jiv = δi xiv − (δi − 1)x̄i,v with:
except centroid j
This dependency means that centroids continue to change even if the clusters
composition does not. In the process of general purpose clustering, the centroid is a
by-product and the partition is the main result. But since the process of topic name
very important to have the centroids computed as exact as possible. That is why the
iteration process should not stop when the clusters stop changing, but rather use a
threshold ². In this manner, the clustering process ends only when the variance of the
The next processing phase of the topic extraction system proposed in Rizoiu et al.
(2010) employs a keyphrase extraction algorithm in order to build a topic name candidate
set. Osinski (2003) presents the conditions that a collocation (or a term) must respect in
• appear in the text with a specified frequency. This is based on the assumption
Topic Extraction for Ontology Learning 25
that the keyphrases that occur often in the text have the strongest descriptive power.
Also, isolated appearances have high chances of being incorrect words (Roche (2004)).
• be a complete phrase. Complete phrases make more sense than incomplete ones
• not begin or end with a stopword. Cluster name candidates will be stripped of
Both LINGO (Osinski (2003)) and the system presented in Rizoiu et al. (2010)
chose the Suffix Array based (Manber and Myers (1990)) approach for the keyphrase
extraction task. They motivated their choice by the approach’s ability to extract the
phrases from untreated text, its language independence, linear execution time and the
power to extract humanly readable phrases. Also, both systems were designed to extract
topics from texts found on the Internet, which requires a great flexibility and stability
towards different writing styles - which can vary from informal discussions to scientific
articles - and different languages. These two characteristics make Web Mining particularly
“Keyphrase Extraction”.
Suffix Array based makes use of the propriety of completeness ( defined in subsection
works in two steps: in the first step left and right complete expressions are found. In the
second step the two sets are intersected to obtain the set of complete expressions.
relies on the usage of Suffix Array. A Suffix Array is an alphabetically ordered array of all
suffixes of a string. We note here that in our case, the fundamental unit is not the letter
Topic Extraction for Ontology Learning 26
(as in the case of classical strings), but the term / word. For example, having the phrase
“we are having a reunion”, the Suffix Array for it would be constructed as shown in
Table 2. One of the most important problems in the construction of the Suffix Array is
the space-time and time-efficient sorting of the suffixes. In Larsson (1998), two approaches
are presented: “Manber and Myers” and “Sadakane’s algorithm”. The paper also makes a
comparison of the two, from both the theoretical and practical performance point of view.
According to the test results of Larsson (1998), the second approach gives better results in
terms of efficiency.
The only thing required for the algorithm is that the terms have a lexicographic
order, so they can be compared. If in the example in table 2, for the sakes of clarity, we
have used the alphabetical order, in real-case implementation, the criteria used is not
important. The order of term arrival into the collection can also be used. The “Sadakane’s
sorting algorithm” is a modified bucket sorting, which takes into consideration the unequal
dimensions of the suffixes. In Larsson (1998), it is shown that the sorting complexity is
O(n log n), with n the number of suffixes. Considering that a keyphrase can not pass the
boundary of a sentence, the implementation in Rizoiu et al. (2010)differs from that prosed
in Osinski (2003) by constructing the Suffix Array on a sentence based approach, rather
than the whole document approach found in the latter. Therefore a suffix identification is
given not only by the beginning of the suffix, but also on the index of the sentence.
Complete Phrase Discovery. The general idea behind the right complete keyphrase
discovery algorithm is to linearly scan the suffix array in search for frequent prefixes,
counting their occurrences meanwhile. Once such a prefix is identified, information about
its position and frequency (initially the frequency is 2) is stored along with it.
Once the right complete phrases have been discovered, we also need to discover the
left complete phrases. This can be achieved by applying the same algorithm as before to
the inverse of the document - meaning that another version of the document is created,
Topic Extraction for Ontology Learning 27
having the words in reverse order. While the algorithm finds the right complete phrases in
lexicographic order, the left complete set needs another inversion to recover the right
order.
With both sets in lexicographic order, they can intersected in linear time. Name
candidates are returned along with their frequency. We must note here that the extracted
candidates can also be single terms, as sometimes a single word can be enough for
The last phase is filtering the candidate set. First, only phrases that appear in the
texts with minimum frequency are kept, the rest are eliminated. In Osinski (2003), the
value of this threshold is suggested to be between 2 and 5. The relatively low value for it
can be explained by the fact that the most frequent expressions are not necessarily the
most expressive, but usually they are meaningless expressions - noise in the output.
enumerated at the beginning of this subsection: not to begin or to end with a stopword.
Using the same methods as in the pretreatment phase, leading and trailing stopwords are
recursively eliminated from the phrases. As a result some of the candidates disappeared
completely (they were composed only from stopwords), while others reduced their form to
another one (example: “the president” and “president of” become both “president”).
The clustering phase outputs a data partition that regroups documents relatively
their thematic similarity. At the same time, this phase outputs the centres of each class,
also called centroids, which can be regarded as abstract representations of the topics.
These centres are documents in the Vector Space Model, having high weight for the terms
that are specific for the group, which means words that are characteristic for the topic.
On the other hand, the keyphrase extraction phase generates a list of name
Topic Extraction for Ontology Learning 28
candidates for the topics. In this last phase, a suitable name is chosen for each centroid in
order to label the topics. This ‘centroid - name’ association is done by taking all the name
candidates and reintroducing them into the Vector Space Model document collection as
because the keyphrases were extracted from natural language texts, they contain inflected
words and stopwords. Afterwards, they are translated into the Space Vector Model, using
the same term weighting scheme as for the original documents of the collection. The last
step is to calculate the similarity between each of these “pseudo-documents” and the
centroid of the class. The one that scores highest is chosen to be the cluster name.
This phase filters out the noise from the keyphrase candidate set. While centroids
are the essence of the documents in those classes, choosing the candidates that are closest
to them naturally eliminates phrases that are too general. For example: in a document
group that talks mainly about politics, the most important terms (measured with a term
“politics” etc. When calculating the similarity ( cosine similarity ) between this centroid
and phrase candidates, is is natural that a candidate that contains as many of those words
would be favoured. The phrase “presidential elections” would clearly scored higher than
This candidate pruning side-effect resembles the hybrid approaches presented in the
subsection “Keyphrase Extraction”, without the actual linguistic filter. From such a
linguistic-free approach - more suitable for the Web Mining - could surely benefit the
In this subsection we will briefly present some experiments and results that can be
obtained with this system. The English dataset used in this tests is a sub-partition of the
Topic Extraction for Ontology Learning 29
Reuters 2 corpus, composed of 262 documents. The writing style is journal article,
containing between 21 and 1000 words. The authors also used in their experimentations
French forums, to test the performances of their systems on languages other than English
Experiments were performed to test both the clustering phase and keyphrase
Cleuziou (2007), Cleuziou (2009) and Rizoiu et al. (2010). The authors’ main approach
towards evaluating the quality of the resulted partition is to use the classical precision,
recall and FMeasure indicators on a corpus that has been tagged a priori by human
experts. They used a sub-collection of the Reuters corpus that had at least a tag
associated. Using this method of evaluation, the authors concluded that the overlapping
approach indeed out-performs the classical crisp algorithms when being used for text
clustering.
The evaluation of the topic names extracted with the Suffix Array approach is done
in Osinski (2003) and Rizoiu et al. (2010). Here, the authors have used an expert based
evaluation of cluster names, arguing that there are no widely accepted automatic topic
quality measures (see next subsection). Since topic names need to be humanly-readable
and they need to synthesize the thematic of a group of texts, evaluating them is like
trying to evaluate “human tastes”. The experiments showed a rather good acceptance by
the users of the extracted topics, especially when using the Term Frequency and the
Table 1 presents an output example of extracted topics. The algorithm was run on
the dataset presented at the beginning of this section. It was demanded to extract four
topics. The first column shows the extracted topics: “cocoa buffer stock”, “oil and gas
company”, “tonnes of copper” and “united food and commercial workers”. The second
column presents, for each topic, the ten words/terms in the texts that have achieved the
Topic Extraction for Ontology Learning 30
highest scores. This words are the most important part of the centroid of each class. As
they are outputted directly by the clustering algorithm, they are in their stemmed version.
The next two column present the number of documents covered and three examples of
documents that are part of the clusters of each topic. Let’s take as an example the most
important topic of the dataset, the one that covers the most texts: “oil and gas company”.
The first two examples talk directly about the economical activities of companies that
operate in the business of oil and natural gas (buying oil and natural gas proprieties in the
first case and estimating reserves in the second case). On the other hand, the third
document talks about the food-for-oil program between Brazil and Iraq. Despite that the
text does not refer to a oil company, as in the first two cases, the document is still placed
under this topic. This is because it still touches the thematic of “oil and gas”, as it does
with the thematic of food. That is why this document is also found under the topic
We can see the this example how important is the overlapping propriety of the
clustering algorithm. With a crisp approach, this document would have been only under
one topic, when in fact it talks about two topics. Still, the extracted topics are too
specific. Topics like “food and gas” and “food” would have been more appropriate.
Over the last ten year, several approaches has been proposed in order to regroup
textual datasets into homogeneous clusters and, moreover, to label these clusters with
topic names. Among these various approaches, some models are able to deal with the
overlapping issue. That is the crucial point because it allows texts to be related to more
than one unique topic. Here stands an important dichotomy between a “fuzzy” approach
(each text is covered more or less by the topics) and a “crisp” approach (each text is
exactly covered by one to several topics). Until now, the literature does not present a
Topic Extraction for Ontology Learning 31
rigorous comparison between different approaches for topic extraction - such as LDA, LSI,
BKM, OKM, etc. - in term of assessment of topic name. The main reason is probably
that the comparison criterion is difficult to set, which is highly linked to the crucial
In this chapter, we present different approaches to extract useful topic names from
texts. Even if some works try to avoid such an additional step (X. Wang et al. (2007)),
distribution over words. The extracted phrases are often more intelligible than series of
single words. They may be a key to fill the gap between topics and concepts (the topic
The “State of the Art” section must be read from two points of view. On the one
hand, it provides the ingredients for the topic extraction system presented in the second
part of the chapter. But on the other hand, all these algorithms can be used at the
different layers of the Ontology Learning Layer Cake. The keyphrase extraction algorithms
can be used for the term extraction at the Term Layer, while the clustering techniques can
be employed at the synonym layer. Here the overlapping issue seems important in the
disambiguation task. Allowing terms to be regrouped in more than one cluster means, in
fact, letting the different meanings of a term be put together with their synonyms.
The chapter ends with the presentation of a whole integrated system. This system
addresses the problem of topic extraction from textual data. The texts we are interested
in present some rather challenging particularities, like being multilingual, having very
different writing styles and purposes (from informal chats to academic microbiology
articles). The main advantage of this system is that is allows overlapping between the
clusters of texts, so that a text could be defined by more than one topic, which is an
For the concept learning, this system allows an extraction of terms and phrases
Topic Extraction for Ontology Learning 32
extracted terms - into pseudo-documents and injecting them back into the Vector Space
Model, the terms can be pruned, actually obtaining a less noisy list of terms. This is
similar to adding linguistic filters to statistic methods, but without their language and
field dependency.
With the problem of topic extraction partly solved, still remains the most strategic
issue: filling the gap between topics and concepts. This is highly related to other
Hierarchical Agglomerative Clustering and the concept hierarchies used in Format Concept
Analysis. The recent work of Estruch, Orallo, and Quintana (2008) proposes in this line
an original framework to fill the gap between statistics and logic. Part of the solution is to
make contributions relative to the assessment of topic quality. Other works are precisely
directed towards such issues (Boyd-Graber, Chang, Gerrish, Wang, and Blei (2009)). The
key question now is then to evaluate the necessity and level of human-expert intervention.
Two other important perspectives are related to the question of granularity. The
“horizontal” granularity deals with building hierarchies of topics: each level of the
hierarchy presents topics which are more general than the topics of the level below.
Recently, several works try to address this issue. For instance, D. Blei et al. (2004); C.
Wang and Blei (2009) build topic hierarchies based on the nested chinese restaurant
process. Such topic hierarchies seem to be more adapted than “flat” topics in the task of
concept construction. At the same time, it brings topics closer and closer to concepts, as
these hierarchies provide a relation of taxonomy. The “vertical” granularity deals with the
evolution of topics through time. Several probabilistic models have recently been proposed
(D. Blei and Lafferty (2006); C. Wang, Blei, and Heckerman (2008)). It would be of high
interest to relate such dynamic models to other work in concept learning, such as those
presented by Chen, Wang, and Zhou (2009). This kind of works will certainly help to
Topic Extraction for Ontology Learning 33
References
clustering algorithm for topic discovering and labeling. In Ciarp ’08: Proceedings of
Heidelberg: Springer-Verlag.
Berry, M. W., Dumais, S., O’Brien, G., Berry, M. W., Dumais, S. T., & Gavin. (1995).
Using linear algebra for intelligent information retrieval. SIAM Review, 37, 573–595.
Biskri, I., Meunier, J. G., & Joyal, S. (2004). L’extraction des termes complexes : une
belgique), gérard purnelle, cédrick fairon & anne dister (eds). presses universitaires
Blei, D., Griffiths, T., Jordan, M., & Tenenbaum, J. (2004). Hierarchical topic models and
Blei, D. M., Ng, A. Y., Jordan, M. I., & Lafferty, J. (2003). Latent dirichlet allocation.
Boyd-Graber, J., Chang, J., Gerrish, S., Wang, C., & Blei, D. (2009). Reading Tea Leaves:
Buitelaar, P., Cimiano, P., & Magnini, B. (2005). Ontology learning from texts: An
Chen, S., Wang, H., & Zhou, S. (2009). Concept clustering of evolving data. In Icde (pp.
Cimiano, P., Völker, J., & Studer, R. (2006). Ontologies on demand? -a description of the
Topic Extraction for Ontology Learning 35
state-of-the-art, applications, challenges and trends for ontology learning from text
Cleuziou, G. (2007). Okm : une extension des k-moyennes pour la recherche de classes
31-42). Cépaduès-Éditions.
Dias, G., Guilloré, S., & Lopes, J. G. P. (2000, 22-24 March). Extraction automatique
de Lausanne.
Dunn, J. C. (1973). A fuzzy relative of the isodata process and its use in detecting
Estruch, V., Orallo, J., & Quintana, M. (2008). Bridging the gap between distance and
Geraci, F., Pellegrini, M., Maggini, M., & Sebastiani, F. (2006). Cluster generation and
cluster labelling for web snippets: A fast and accurate hierarchical solution. In (pp.
25–36).
Godoy, D., & Amandi, A. (2006). Modeling user interests by conceptual clustering.
Kietz, J., Maedche, A., & Volz, R. (2000). A method for semi-automatic ontology
LUNDFD6/(NFCS-3130)/1–43/(1998)).
how to tell a pine cone from an ice cream cone. In Sigdoc ’86: Proceedings of the 5th
Manber, U., & Myers, G. (1990). Suffix arrays: A new method for on-line string searches.
Michalsky, R., & Stepp, R. (1983). Learning from observation: conceptual clustering, in:
Osinski, S. (2003). An algorithm for clustering of web search results. Unpublished master’s
Parsons, L., Haque, E., & Liu, H. (2004). Subspace clustering for high dimensional data: a
Rizoiu, M.-A., Velcin, J., & Chauchat, J.-H. (2010, janvier). Regrouper les donnèes
Rodrı́guez, C.(2005). The ABC of Model Selection: AIC, BIC and the New CIC. Bayesian
Inference and Maximum Entropy Methods in Science and Engineering, 803, 80–87.
Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval.
Salton, G., Wong, A., & Yang, C. S. (1975, November). A vector space model for
Silva, J. da, Dias, G., Guilloré, S., & Pereira. (1999). Using localmaxs algorithm for the
Steinbach, M., Karypis, G., & Kumar, V. (2000). A comparison of document clustering
techniques.
Turcato, D., Popowich, F., Toole, J., Fass, D., Nicholson, D., & Tisher, G. (2000).
Velcin, J., & Ganascia, J.-G. (2007). Topic extraction with agape. In Adma (p. 377-388).
Topic Extraction for Ontology Learning 38
Wang, C., & Blei, D. (2009). Variational Inference for the Nested Chinese Restaurant
Process. In Nips.
Wang, C., Blei, D., & Heckerman, D. (2008). Continuous time dynamic topic models.
Wang, X., McCallum, A., & Wei, X. (2007). Topical n-grams: Phrase and topic discovery,
Notes
1 http://www.clef-campaign.org/
2 http://mlr.cs.umass.edu/ml/datasets/
Reuters-21578+Text+Categorization+Collection
Topic Extraction for Ontology Learning 40
Table 1
Table 2
Suffix Array construction for the phrase “we are having a reunion”
1 a reunion 4
3 having a reunion 3
4 reunion 5
Figure Captions