Telstem:An Unsupervised Telugu Stemmer With Heuristic Improvements and Normalized Signatures

CHAPTER 4
TELSTEM :AN
UNSUPERVISED TELUGU
STEMMER WITH
HEURISTIC
IMPROVEMENTS AND
NORMALIZED
SIGNATURES
112
4.1 CONCEPT OF STEMMING
Linguistically, words follow the morphological rules that allow a speaker to
derive variants of a same idea to evoke an action (verb), an object or concept (noun)
or the property of something (adjective). For instance, the following words are
derived from the same stem and share an abstract meaning of action and movement:
activate activating active activeness
activated activation actively actives
activates activations activeness etc.
Stemming deduces the stem from a fully suffixed word according to its
morphological rules. These rules concern morphological and inflectional suffixes. The
former type usually changes the lexical category of words, whereas the latter indicates
plural and gender (in gender oriented languages such as French, Spanish and
German):
Morphological suffix : activate (verb) à activation (noun)
Inflextional suffix : activation (noun) à activations (plural noun)
Stemming is a technique to transform different inflections and derivations of
the same word to one common "stem". Stemming can mean both prefix and suffix
removal. Stemming can, for example, be used to ensure that the greatest number of
relevant matches is included in search results. A word's stem is its most basic form:
for example, the stem of a plural noun is the singular; the stem of a past-tense verb is
113
the present tense. The stem is, however, not to be confused with a word lemma, the
stem does not have to be an actual word itself. Instead the stem can be said to be the
least common denominator for the morphological variants.
4.1.1 COMPARISION OF STEMMING AND LEMMATIZATION
The motivation for using stemming instead of lemmatization, or indeed
tagging of the text, are mainly a question of cost. It is considerably more expensive, in
terms of time and effort, to develop a well performing lemmatizer than to develop a
well performing stemmer. It is also more expensive in terms of computational power
and run time to use a lemmatizer than to use a stemmer. The reason for this is that the
stemmer can use ad-hoc suffix and prefix stripping rules and exception lists while the
lemmatizer must do a complete morphological analysis (based on an actual
grammatical rules and a dictionary). Another point of motivation is that a stemmer
can deliberately “bring together” semantically related words belonging to different
word classes to the same stem, which a lemmatizer cannot.
The goal of both stemming and lemmatization is to reduce inflectional forms
and sometimes derivationally related forms of a word to a common base form.
The two words stemming and lemmatization differs in their flavor. Stemming
usually refers to a crude heuristic process that chops off the ends of words in the hope
of achieving this goal correctly most of the time, and often includes the removal of
derivational affixes. Lemmatization usually refers to doing things properly with the
use of a vocabulary and morphological analysis of words, normally aiming to remove
inflectional endings only and to return the base or dictionary form of a word, which is
114
known as the lemma. If confronted with the token “saw”, stemming might return just
s, whereas lemmatization would attempt to return either see or saw depending on
whether the use of the token was as a verb or a noun. The two may also differ in that
stemming most commonly collapses derivationally related words, whereas
lemmatization commonly collapses only the different inflectional forms of a lemma.
Linguistic processing for stemming or lemmatization is often done by an additional
plug-in component to the indexing process and a number of such components exist in
both commercial and open-source.
The most common algorithm for stemming English, and the one that has
repeatedly been shown to be empirically very effective is Porter's algorithm. The
entire algorithm is too long and intricate to present here, but we will indicate its
general nature. Porter's algorithm consists of 5 phases of word reductions, applied
sequentially. Within each phase, there are various conventions to select rules, such as
selecting the rule from each rule group that applies to the longest suffix.
Many of the later rules use a concept of the measure of a word, which loosely
checks the number of syllables to see whether a word is long enough that it is
reasonable to regard the matching portion of a rule as a suffix rather than as part of
the stem of a word.
Stemmers use language-specific rules, but they require less knowledge than a
lemmatizer, which needs a complete vocabulary and morphological analysis to
correctly lemmatize words. Particular domains may also require special stemming
rules. However, the exact stemmed form does not matter, only the equivalence classes
it forms.
115
Rather than using a stemmer, you can use a lemmatizer , a tool from Natural
Language Processing which does full morphological analysis to accurately identify
the lemma for each word. Doing the full morphological analysis produces at most
very modest benefits for retrieval. It is hard to say more, because either form of
normalization tends not to improve English information retrieval performance in
aggregate - at least not by very much. While it helps a lot for some queries, it equally
hurts performance a lot of others. Stemming increases recall while harming precision.
As an example of what can go wrong, note that the Porter stemmer stems all of the
following words:
operate operating operates operation operative operatives operational
to “oper”. However, since “operate” is a common verb in this various forms, we
would expect to lose considerable precision on queries such as the following with
Porter stemming:
operational and research

operating and system
operative and dentistry
For a case like this, moving to using a lemmatizer would not completely fix
the problem. Because, particular inflectional forms are used in particular collocations:
for a sentence with any of the various forms of the word “operate”, the system is not a
good match for the query “operating and system”. Getting better value from term
normalization depends more on pragmatic issues of word use than on formal issues of
linguistic morphology.
116
4.1.2 TYPES OF STEMMING
The ability of an Information Retrieval (IR) system to conflate the words
allows reducing index and enhancing recall sometimes even without significant
deterioration of precision. Conflation also conforms to user's intuition because users
don't need to worry about the "proper" morphological form of words in a query. The
problem of automated conflation implementation is known as "stemming" algorithms.
One of the key questions in conflation algorithms is if an automated stemming
can be as much efficient (for IR purposes) as a manual processing. Another important
and related problem is if a word should be truncated only on the right root morpheme
boundary or at a non-linguistically correct point.
So, despite that stemming often relies on knowledge of language morphology,
its goal is not to find a proper meaningful root of a word. Instead, a word can be
truncated at a position "incorrect" from the natural language point of view. For
example, the Porter's algorithm [73] can make next productions:
probate -> probat

cease -> ceas
Apparently, the results are not morphologically right forms of words.
Nevertheless, since the document index and queries are stemmed "invisibly" for a
user, this particularity should not be considered as a flaw, but rather as a feature
distinguishing stemming from lemmatization.

117
All stemming algorithms can be roughly classified as affix removing,
statistical and mixed. Affix removal stemmers apply set of transformation rules to
each word, trying to cut off known prefixes or suffixes. First such algorithm was
described by [53].
The major drawback of the affix removal approach is their dependency on a-
priory knowledge of language morphology. Statistical algorithms try to cope with this
problem by finding distributions of root elements in a corpus. Such algorithms started
evolving only recently, as increase in computers power made feasible heavy
computations necessary for such approaches.
Mixed algorithms can combine several approaches. For example, an affix
removal algorithm can be enhanced by dictionary lookups for irregular verbs or
exceptional plural/singular forms like "feet/foot".
Variety of stemming algorithms essentially brings up a question about their
comparison. Though explicit measures like under-stemming (removing too less a
suffix) and over-stemming (removing too much) do exist, they are hard to use due to
lack of a standard testing set. So usually, stemmers are compared indirectly with their
effect on search recall. Performance characteristics (speed and storage requirements)
are used as well.

118
Co n f l a t i o n M e t h o d s
A u t o ma t i c
Ma n u a l
( S t e mme r s )
))))
Affix Table
Successor n-gram
R e mo v a l L o o ku p
Longest S i mp l e
Ma t c h R e mo v a l
Figure 4.1: Types of stemming
TYPE OF STEMMING ALGORITHMS

 Table lookup approach
 Successor Variety
 n-gram stemmers
 Affix Removal Stemmers
TABLE LOOKUP APPROACH
Store a table of all index terms and their stems, so that terms from queries and
indexes could be stemmed very fast.
PROBLEMS
 There is no such data for English. Or some terms are domain dependent.
 The storage overhead for such a table, though trading size for time is
sometimes warranted.
119
SUCCESSOR VARIETY APPROACH

 Determine word and morpheme boundaries based on the distribution of
phonemes on a large body of utterances.
 The successor variety of a string is the number of different characters that

follow it in words in some body of text.
 The successor variety of substrings of a term will decrease as more characters

are added until a segment boundary is reached.
Test Word: READABLE
Corpus: ABLE, APE, BEATABLE, FIXABLE, READ, READABLE,
READING, READS, RED, ROPE, RIPE
Prefix Successor Variety Letters
R 3 E,I,O
RE 2 A,D
REA 1 D
READ 3 A,I,S
READA 1 B
READAB 1 L
READABL 1 E
READABLE 1 (Blank)
120
 cutoff method
 some cutoff value is selected and a boundary is identified whenever the

cutoff value is reached
 peak and plateau method
 segment break is made after a character whose successor variety

exceeds that of the characters immediately preceding and following it
 complete method
 entropy method
 | Di :| the number of words in a text body beginning with the i length
sequence of letters 
 | Dij | : the number of words in Di with the successor j
 The probability that a member of number of words in Di
has the successor j is given by | Dij |

| Di |
 The entropy of | Di | is
26
| Dij | | Dij |
Hi     log 2
j 1 | Di | | Di |
 Two criteria used to evaluate various segmentation methods
1. The number of correct segment cuts divided by the total number of

cuts
2. The number of correct segment cuts divided by the total number of the
true boundaries
 After segmenting, if the first segment occurs in more than 12 words in the
corpus, it is probably a prefix.
 The successor variety stemming process has three parts
1. Determine the successor varieties for a word
2. Segment the word using one of the methods
3. Select one of the segments as the stem

121
N-GRAM STEMMERS
 Association measures are calculated between pairs of terms based on shared

unique digrams.
statistics => st ta at ti is st ti ic cs
unique digrams = at cs ic is st ta ti
statistical => st ta at ti is st ti ic ca al
unique digrams = al at ca ic is st ta ti
 Dice’s coefficient (similarity)
2C 2*6
S   .80
A B 78
A and B are the numbers of unique diagrams in the first and the second
words. C is the number of unique diagrams shared by A and B.
 Similarity measures are determined for all pairs of terms in the database,
forming a similarity matrix
 Once such similarity matrix is available in which the terms are clustered using
a single link clustering method.
AFFIX REMOVAL STEMMERS
 Affix removal algorithms remove suffixes and/or prefixes from terms leaving
a stem
 If a word ends in “ies” but not ”eies” or ”aies ”
Then “ies” -> “y”
 If a word ends in “es” but not ”aes” , or ”ees ” or “oes”
Then “es” -> “e”
 If a word ends in “s” but not ”us” or ”ss ”
Then “s” -> “NULL”
THE PORTER ALGORITHM
 The Porter algorithm consists of a set of condition/action rules.

122
 The condition fall into three classes
 Conditions on the stem
 Conditions on the suffix
 Conditions on rules
4.1.3 ERRORS IN STEMMING
Natural languages are not completely regular constructs and therefore
stemmers operating on natural words inevitably make mistakes. On the one hand,
words which ought to be merged together (such as "adhere" and "adhesion") may
remain distinct after stemming; on the other, words which are really distinct may be
wrongly conflated (e.g., "experiment" and "experience"). These are known as
understemming errors and overstemming errors respectively. By counting these errors
for a sample of words, we can gain an insight into the operation of a stemmer, and
compare different stemmers.
To enable the errors to be counted, the words in the collection must already be
organized into 'conceptual groups', containing words (such as "adhere", "adheres",
"adhering", "adhesion", "adhesive") which ought all to be merged to the same stem.
 If the two words belong to the same conceptual group, and are converted to
the same stem, then the conflation is correct; if however they are converted to
different stems, this is counted as an understemming error.
 If the two words belong to different conceptual groups, and remain distinct
after stemming, then the stemmer has behaved correctly. If however they are
converted to the same stem, this is counted as an overstemming error.
123
4.1.4 BENEFITS OF STEMMING
Takes care of morphological variants. When an index is stemmed, users do
not need to include several forms of a same word in a query: "sale OR sales".
Stemming handles these variations such that the queries "sale" and "sales" will return
the same results.
Reduces index size. Indexed terms are stored in the lexicon. A stemmed index
means that the linguistic roots are stored in the lexicon rather than in the whole terms.
Each entry in the lexicon contains references to documents in which it appears. When
stemming is used, several terms are conflated as well as information relative to
document references, hence significantly reducing the size of the index.
4.1.5 LIMITATIONS OF STEMMING
It is important to keep in mind that the stemming limitations tend to disappear
with queries including more than one term.
Occasional increased recall. Since more documents are likely to be retrieved
by a single query, the desired documents may be returned with a large number of
other related documents. If a query returns too many documents, narrow it down by
adding more words to the query.
Exceptions to stemming rules. Although stemming has been used since the
early days of computer science, algorithms that were developed for that purpose still
suffer from the same limitation, that is, the conflation of words that are not
semantically related, such as "university" and "universal". This means that a query
124
containing the term "universal" may retrieve a document that contains "university".
Although stemming rules' exceptions are limited, it would be expensive to handle
them in terms of memory and speed, given their relative rareness. It is then required to
understand the stemming mechanisms to recognize and bypass these situations when
they happen, because Coveo Enterprise Search's "Optimized for English" and
"Multilingual Stemming" both suffer from this limitation.
Impact on advanced syntax and exact match. Advanced syntax and exact
match queries work the same, but the terms in such queries are stemmed. In the case
of advanced syntax, it can lead to unexpected results, if different terms in the query
share the same stem. For instance, "consumers NOT consuming" is stemmed to
"consum- NOT consum-" and leads to no result. Exact match queries can also
occasionally retrieve non expected results because Coveo Enterprise Search looks for
exact matches of stems. For instance, the exact match string "sales reports" matches
"sal- report-".
Accented characters are not supported. For performance purposes only,
Coveo Enterprise Search removes accents from characters before stemming.
Consequently, some accented characters in stems lose their distinction in languages
such as French, Spanish and German, although, this very rarely leads to confusion
between stems.
The short words are not stemmed. Words of four or fewer letters are not
stemmed. Short words tend to be more error prone to stemming; especially in English,
where concerns must be addressed to recognize short and long syllables. Since this
situation refers to only a small number of terms, the short words are not stemmed to
125
favor performance. Therefore, morphological variations of short words will not be
implicitly included in a query.
4.2 RELATED WORK
In most cases, morphological variants of words have similar semantic
interpretations and can be considered as equivalent for the purpose of IR applications.
For this reason, a number of so-called stemming Algorithms, or stemmers, have been
developed, which attempt to reduce a word to its stem or root form. Thus, the key
terms of a query or document are represented by stems rather than by the original
words. This not only means that different variants of a term can be conflated to a
single representative form – it also reduces the dictionary size, that is, the number of
distinct terms needed for representing a set of documents. A smaller dictionary size
results in saving of storage space and processing time.
For IR purposes, it doesn't usually matter whether the stems generated are
genuine words or not – thus, "computation" might be stemmed to "comput" –
provided that (a) different words with the same 'base meaning' are conflated to the
same form, and (b) words with distinct meanings are kept separate. An algorithm
which attempts to convert a word to its linguistically correct root ("compute" in this
case) is sometimes called a lemmatizer.
Examples of products using stemming algorithms would be search engines
such as Lycos and Google, and also thesauri and other products using NLP for the
purpose of IR. Stemmers and lemmatizers also have applications more widely within
the field of Computational Linguistics.

126
From the last five decades, many stemming algorithms were proposed and
implemented to improve information retrieval task. The first stemming algorithm was
written by [53]. This is single pass, context sensitive and uses longest suffix stripping
algorithm. Lovins stemmer maintains 250 suffixes. These suffixes were used in
stemming of strange words. This stemmer was remarkable for its early date and had
great influence at later work in this area. Later Martin Porter designed and
implemented stemming algorithm for English language at the University of
Cambridge [52]. It is five step procedure, some rules were applied under each step. If
a word is matched with the suffix rule, then the condition attached to that rule is
executed to get the stem. Porter stemmer is a linear process. It is widely used as it is
readily available. [121] of the Literary and Linguistics Computing Centre at
Cambridge University developed one stemmer that is based on the Lovins stemmer by
extending the suffix list to 1200 and modifying recoding rules.[54] developed another
stemming algorithm at the University of Massachusetts, in 1993 and it is based on
morphology and uses dictionary.
The techniques used in these stemmers were rule based. Preparing such rules
for a new language is time consuming and these techniques are not applicable to
morphologically rich languages like Telugu.
Later unsupervised approaches came into the picture.[4]proposed
unsupervised morphology technique based on Minimum Description Length (MDL).
This is based on Bayesian model for English and French languages.[56] uses the
MDL framework to learn unsupervised morphological segmentation of European
languages.[57] presented language independent and unsupervised algorithm for word
segmentation that is based on the prior distribution of morpheme length and

127
frequency.[117] proposed stemming algorithm based on automatic clustering of words
using co-occurrence information.
Previous work on morphology of Indian languages includes lightweight
stemmer for Hindi [58], statistical stemmer for Hindi, unsupervised stemmer for
Hindi [59], unsupervised morphological analyser for Bengali, hybrid stemmer for
Gujarati [60] and rule based Telugu morphological generator (TelMore) for Telugu
[61]. Hybrid stemmer is not completely unsupervised. This stemmer uses handcrafted
suffix list. TelMore generates the morphological forms of the verb and noun by using
the rules of [122,123].
Word segmentation heuristic is based on a Goldsmith’s approach [56]. This
heuristic is based on split-all method that uses Boltzmann distribution. After applying
normalization heuristic, the performance of proposed stemmer is increased. The
proposed paradigm presented in this paper focuses on the development of an
unsupervised Telugu stemmer for the effective Telugu information retrieval task.
A probabilistic model for stemmer generation [62] is to take a step forward
from the graph-based stemming to introduce a probabilistic framework which models
the mutual reinforcement between stems and derivations. The idea here is to consider
stemming as the inverse of a machine which generates words by concatenating
prefixes and suffixes. It then shows how the estimation of the probabilities of the
model relates to the notion of mutual reinforcement and to the discovery of the
communities of stems and derivations.

128
An Unsupervised Hindi stemmer with heuristic improvements [59]
outperforms both light weight stemmer [58] and UMass stemmer, it does not require
any linguistic input. Hence, it can be easily adapted to other languages. The approach
used in this work is unsupervised as it does not require inflection-root pair for
training. This makes it easily applicable to a new language. It is language independent
because it does not require any language specific rules as input. The approach does
not require any domain specific knowledge; hence it is domain independent as well.
The algorithm is computationally inexpensive. The number of suffixes in the final list
is 51 and longest suffix stripping is used to perform stemming. Some post-processing
heuristic repairs have been performed to further refine the learned suffixes. For this
stemmer, the training data has been constructed by extracting 106403 words extracted
from EMILLE corpus. The observed accuracy was found to be 89.9%, after applying
some heuristic measures. The F-score was 94.96%. An Unsupervised Hindi stemmer
with heuristic improvements approach is partly in line with [56] approach. It is based
on split-all method. For unsupervised learning (training), words from Hindi
documents from EMILLE corpus have been extracted. These words have been split to
give n-gram (n=1, 2, 3 … l) suffix, where l is length of the word. Then it computes
suffix and stem probability. These probabilities are multiplied to give a split
probability. The optimal segment corresponds to the maximum split probability. Some
post-processing steps have been taken to refine the learned suffixes.
Language is very rich in literature, and it requires advancements in
computational approaches. Applications like machine translation, speech recognition,
speech synthesis and information retrieval need a powerful morphological generator
to give morphological forms of nouns and verbs. The existing Telugu morphological
129
analyzer (TMA) is rule based. The performance of it is further improved by the novel
approach [63], which provides a system that gives information about possible
decompositions of the word inflected by many morphemes. Using these possible
decompositions, the root word could be extracted for those words which were
unrecognized by rule based morphological analyzer. The experiment is conducted on
a Telugu text corpus from CIIL Mysore and the improvement in the performance is
checked by the rule based morphological analyzer developed by a LTRC group, IIIT
and HCU, Hyderabad. The observed increase in performance of rule based is from
77% to 84.2% for words which are in the hundreds. It can still be improved if the
corpus is increased.
Unsupervised Learning of the Morphology of a Natural Language [56] reports
the results of using minimum description length (MDL) analysis to model
unsupervised learning of the morphological segmentation of European languages,
using corpora ranging in size from 5,000 words to 500,000 words. They developed a
set of heuristics that rapidly develop a probabilistic morphological grammar, and use
MDL as the primary tool to determine whether the modifications proposed by the
heuristics will be adopted or not. The resulting grammar matches well the analysis
that would be developed by a human morphologist. In the final section, they discuss
the relationship of this style of MDL grammatical analysis to the notion of evaluation
metric in early generative grammar.
Statistical Stemming and Back off Translation [64] describes a cross-language
information retrieval architecture based on balanced document translation. A four-
stage back off strategy for improving the coverage of dictionary-based translation
techniques are then introduced and an implementation based on automatically trained

130
statistical stemming is presented. Results indicate that competitive performance can
be achieved using a four - stage back off translation in conjunction with freely
available bilingual dictionaries, but the usefulness of the statistical stemming
algorithms were tried considerably across the three languages to which they were
applied.
To build a Cross-Lingual Information Retrieval (CLIR) system as a part of the
Indian language sub-task of the main Adhoc monolingual and bilingual track, the task
required retrieval of relevant documents from an English corpus in response to a
query expressed in different Indian languages including Hindi, Tamil, Telugu,
Bengali and Marathi. Groups participating in this track were required to submit an
English to English monolingual run and a Hindi to English bilingual run with optional
runs in the rest of the languages. Their submission consisted of a monolingual English
run and a Hindi to English cross-lingual run. A cross - Lingual Information Retrieval
System for Indian Languages [65] used a word alignment table that was taught by a
Statistical Machine Translation (SMT) system trained on aligned parallel sentences, to
map a query in the source language into an equivalent query in the language of the
document collection. The relevant documents are then retrieved using a Language
Modeling based retrieval algorithm. On the CLEF 2007 data set, this official cross-
lingual performance was 54.4% of the monolingual performance and in the post
submission experiments, they found that it can be significantly improved up to 76.3%.
To undertake further study of the relative merit of various search engines
when exploring the Hungarian and Bulgarian documents and to evaluate these
solutions, various effective IR models are used. Experiments on Stemming
Approaches for East European Languages [66] generally show that for the Bulgarian
131
language, removing certain frequently used derivational suffixes may improve mean
average precision. For the Hungarian corpus, applying an automatic decompounding
procedure improves the MAP. For the Czech language a comparison of a light and a
more aggressive stemmer to remove both inflectional and some derivational suffixes,
reveal only small performance differences. For this language only, performance
differences between a word-based or a 4-gram indexing strategy is also rather small.
Instead of using a completely unsupervised approach, a lightweight stemmer
for Gujarati using a hybrid approach [60] harnessed the linguistic knowledge in the
form of a handcrafted Gujarati suffix list in order to improve the quality of the stems
and suffixes learnt during the training phase. They used the EMILLE corpus for
training and evaluating the stemmer’s performance. The use of handcrafted suffixes
boosted the accuracy of this stemmer by about 17% and helped to achieve an accuracy
of 67.86 %.
YASS: Yet Another Suffix Stripper [68] describes a clustering-based approach
to discover equivalence classes of root words and their morphological variants. A set
of string distance measures are defined, and the lexicon for a given text collection is
clustered using the distance measures to identify these equivalence classes. This
approach is compared with Porter’s and Lovin’s stemmers on the AP and the WSJ sub
collections of the Tipster datasets using 200 queries. Its performance is comparable to
that of Porter’s and Lovin’s stemmers, both in terms of average precision and the total
number of relevant documents retrieved. This stemming algorithm also provides
consistent improvements in retrieval performance for French and Bengali, which are
currently resource-poor.
132
4.3 PROPOSED PARADIGM
The proposed paradigm is moderately in Goldsmith's approach [56]. It is
based on taking all splits of each and every word from the corpus. Goldsmith's
heuristic does not require any linguistic knowledge thus it is completely unsupervised.
For unsupervised studying, words from Telugu corpora are considered. To refine the
learned suffixes normalization heuristic is applied. Each word from the corpus is
being considered in l-1 different ways, splitting the word into stem+suffix after i
letters, where 1<=i<=l-1 and l is length of the word. Then the frequencies of the stems
and suffixes of each split of words are computed. These frequencies are used to get an
optimal split of the word. The optimal segment unifies stem and suffix. Making
analogous signatures and removal of spurious signatures are applied to get regular
suffixes. Figure 4.2 shows the layout of the proposed paradigm. The details of these
steps are described below:
4.3.1 WORD SEGMENTATION
Telugu corpus is given as input to this process. The corpus consists of unique
words. 129066 words are considered as corpus. Take-all-splits heuristic is used for
word segmentation and it uses all cuts of a word of length l into stem+suffix w1,i +
wi+1,l, where w refers to the word and 1≤ i < l. This heuristic assigns a value to each
split of the word w of length l: w1, i +wi+1, l. This heuristic takes all splits of a word wl
into stems and suffixes as:
{ }
133
where stem1i refers to prefix of word wl with length i and suffixi+1l refers to the
suffix of the same word with length l-i. For example, the word ‘అధికారము’ is split
into following stems and suffixes:
అ ధికారము అధ ికారము అధి కారము

{ }
అధిక ిారము అధికా రము అధికార ము అధికారమ ి
Trained Telugu
corpus
Word segmentation
with Take-All-Splits Heuristic
Procuring optimal split to produce

generic stems and suffixes
Generating analogous signatures

and removing irregular signatures
Normalization to enhance
proposed paradigm
Removing spurious signatures to

increase no. of effective suffixes
List of powerful
suffixes
Stemming of strange words

with longest suffix matching alg.
Figure 4.2: Proposed paradigm

134
Then, stems and suffixes are stored in the database and heuristic value of each split is
calculated by using the following equation:
The Heuristic value at split i = i log freq (stem = w1,i) + (l - i) log freq (suffix = wi+1,l)
As the value of i changes the heuristic value also changes. Heuristic value
mainly depends upon the frequency of stem and suffix in the corpus. The frequency of
stem and suffix is varying as the length of the stem and the suffix is increasing. The
frequencies of shorter stems and suffixes are very high when compared to the slightly
longer stems and suffixes. Thus the multipliers i (length of the stem) and l-i (length of
the suffix) are introduced in this heuristic in order to compensate for this disparity.
4.3.2 PROCURING OPTIMAL SPLIT
The output of the first heuristic is set of heuristic values for each word. The
split with highest heuristic is treated to be an optimal split of the word. This
segmentation mainly depends upon the frequency distribution of the stems and
suffixes of the word, language and corpus. Some restrictions are considered while
taking optimal split. First, the word length is restricted to a minimum of three. From
this condition, small words like cannot be split. The
second restriction is based on the heuristic value of each split. If any heuristic value of
the split is zero, it is not taken into account. By this restriction lengthy word like
are not considered.

135
The characterization of heuristic values of all splits of word is
shown in Figure 4.3.
40
35
30
Heuristic values
25
20
15
10
5
0
1 2 3 4 5 6 7 8
Index values from 1 to l-1
Figure 4.3: Heuristic values of all splits of the word
The index value is considered on the x-axis and heuristic values are considered
on the y-axis. By looking into the graph, the heuristic values of the word are increased
first and then decreased. The point where the heuristic value starts decreasing is
considered as optimal split of the word. As the stem length increases up to a certain
limit, the frequency of stem starts decreasing. That is the reason why heuristic value
starts decreasing. This step gives the generic stems and generic suffixes of the given
corpus.
4.3.3 ANALOGOUS SIGNATURES
An optimal split is assigned to every word of the corpus into the generic stem
and generic suffix by the above step. After getting generic stems and generic suffixes,
136
the suffixes that are having the same stem are grouped together. This is called stem’s
signature. This structure is shown below:
{ }{ }
Where suffix1, suffix2, suffix3…. are having similar stem, stem1. By
considering Telugu word example, క ుండా, వచ్ ు, తారు suffixes have same stem
అుంతరుంచిపో . By grouping suffixes with the same stem is shown below:
క ుండా
{అుంతరుంచిపో } { వచ్ ు }
తారు
Signature can also be referred to the set of suffixes along with the associated
set of stems. The stems having the same set of suffixes are grouped. Here stem1, stem2
have the same set of suffixes suffix1, suffix2, suffix3 and suffix4. The presence of the
signature structure is shown below:
{ }{ }
Here, at least two members are present in each set and all combinations
present in this structure are available in the corpus and each stem is found with no
other suffix, but the suffix may well appear in other signatures.
137
Consider the Telugu words, అుంతస్ు , అుంతరక్షనౌక and అకాడమీ. This three
words can have same set of suffixes లో, ల and కి. Grouping of set of stems that are
having the same set of suffixes is shown below:
అుంతస్ు లో
{అుంతరక్షనౌక} {ల }
అకాడమీ కి
All signatures that are associated with only one stem and only one suffix are
discarded. These are called irregular signatures. The suffixes that are generated by this
set of signatures can be used directly in stemming of strange words.
4.3.4 NORMALIZATION
It is perceived that among some of the analogous signatures, set of suffixes
having a similar stem involve some common set of prefixes. Thus, to enhance the
performance of the proposed paradigm, another heuristic is applied. This heuristic
normalizes the prefixes of suffixes that are common to the same stem. This is shown
by the following expressions:
suffixi = a1b1suffix1
suffixj = a1b1suffix2
suffixk = a1b1suffix3
suffixl = c1d1e1suffix4
suffixm = c1d1e1suffix5
138
Suffixi, suffixj, suffixk, suffixl and suffixm are having same stem stemi. The
application of this heuristic concatenates stemi with prefixes of suffixes a1b1, c1d1e1
and modifies stem and suffixes. The stems and suffixes after applying normalization
heuristic are stemia1b1, stemic1d1e1, suffix1, suffix2, suffix3, suffix4 and suffix5.
The normalization process by considering the Telugu words is displayed
below:
ిుంగ
ిుంగుం
ిుంగుంలో
ిుంగాన
ిుంగక
తవమని
{అుంతర} తవమున
ము
మున
మునక
ముపై
{ముయొకక}
After applying normalization heuristic to the word ‘అుంతర’, the obtained stems
and suffixes are displayed below:
ిుం
ిుంలో
ిాన
ిక
అుంతరుంగ ని
{అుంతరతవమ} ిన
అుంతరము
న
నక
పై
{యొకక}
139
The suffixes that are obtained after normalization process are refined suffixes.
This heuristic is applied to control over stemming. These refined suffixes can be used
in stemming of strange words.
4.3.5 REMOVING SPURIOUS SIGNATURES
The output of the above step gives set of signatures. Again all signatures that
are associated with only one stem and only one suffix are discarded. These signatures
are called spurious signatures. To enrich the performance of the proposed paradigm
these signatures are removed. The remaining signatures are called refined signatures
and the suffixes that are dropped from them are called refined suffixes. The refined
suffixes are not quite the suffixes that can be used to establish the morphology. But
they are very good for approximation and useful for stemming process.
Some of the refined suffixes produced by proposed paradigm are shown

below:
ల ిాల ి కనాా కి ి తో దాకా ి న ి న న ుంచి నికి ిానిా
ిుంలో ిుంలోకి పు ము గానే లేద లేని గా దని ిామా ిాము
ిామో ిాయి యోస్ ిుండి దాకా ి న ి నక ి న తుల ిిుంచి
ిుంతల ిుంన ుండి పరుంగా ిుంమీద ిుంలో కుం ిాలన ి ల లకి
లకీ ి లక ి లకూ కైతే ియము ిుంటే ి క ని గానే టుం తన
ిాడు ిారు ని గార చెన చేత చేసన చేసే తుల మున

140
4.3.6 STEMMING OF STRANGE WORDS
The output of unsupervised paradigm is a list of powerful suffixes. Here two
sets of suffixes are available. The first suffix list is dropped before normalization
heuristic. A second suffix list is generated after applying normalization process. Each
suffix approximately seizes some morphological variation. The longest suffix
matching algorithm is used in stemming of strange words. To alleviate stemming of
strange words the suffixes are organized in decreasing order of their lengths. The
longest possible suffix of the strange word that is matched with some suffix in the list
is dropped. This way is called longest suffix matching algorithm. Matching is started
from first suffix to last suffix. Wherever a match is found, a strange word is
segmented into the stem and suffix. Figure 4.4 shows the sample output of the
proposed paradigm.
Figure 4.4: Sample output of the proposed paradigm

141
4.4 EXPERIMENTS
Central to information retrieval is the idea of relevance. The goal of an IR
system is to retrieve all the relevant documents while retrieving as few non-relevant
documents as possible. Retrieval performance is normally evaluated by using the
interrelated measures of precision and recall. Precision is defined to be the ratio of the
number of relevant documents retrieved to the total number of documents retrieved.
Recall is defined to be the ratio of the number of relevant documents retrieved to the
total number of relevant documents.
The ideal IR system would achieve 100% recall and 100% precision for all
queries. However, in practice precision and recall tend to vary inversely. A very
specific query formulation tends to produce high precision. But, it also generally
results in the low recall. Conversely, a broader query formulation is likely to retrieve a
greater pool of documents, but the portion relevant is normally smaller, resulting in
lower precision. Most attempts at improving one variable tend to have a negative
effect on the other.
Calculation of recall requires knowledge of the total number of relevant
documents in the collection. For a small document collection, this is possible. But for
larger collections, determining the number of relevant documents may not be
practical. Recall figures for larger collections are often calculated by estimating the
142
number of relevant documents. Using sampling techniques is one method for doing
this. Another method is to perform a series of searches using various retrieval
techniques. The top results of these searches are combined to produce the relevant
documents. This technique, called pooling, is based on the assumption that the
combination of independent retrieval techniques will retrieve all or most relevant
documents.
Figure 4.5: Typical average precision vs. recall
Rather than determining relevance on a boolean, “yes or no”, answer based on
all the terms appearing in the document, a more flexible approach is to compute the
probability of relevance based on the number or frequency of query terms that appear
in the document. The more terms that occur, the greater the probability that the
document is relevant. A document that contains even one of the terms is viewed as a
potential answer, but documents that contain all or most of the terms will receive the
143
highest ranks. The ranking approach to information retrieval retrieves documents in
decreased ranking order of likely relevance to the user’s query.
The formula that measures the performance a user can expect from a system
can be defined by taking the average over a series of sample queries.
where num is the number of queries
A single measure that trades off precision versus recall is the F- measure,
which is the weighted harmonic mean of precision and recall:
where α [0, 1] and thus β2 [0,∞]. The default balanced F measure equally
weights precision and recall, which means making α = 1/2 or β = 1. It is commonly
written as F1, which is short for Fβ=1, even though the formulation in terms of α more
transparently exhibits the F measure as a weighted harmonic mean. When using β = 1,
the formula on the right simplifies to:

144
However, using an even weighting is not the only choice. Values of β < 1
emphasize precision, while values of β > 1 emphasize recall. For example, a value of
β = 3 or β = 5 might be used if the recall is to be emphasized. Recall, precision, and
the F measure are inherent measures between 0 and 1, but they are also very
commonly written as percentages on a scale between 0 and 100.
Mean Average Precision (MAP), which precision provides a single-figure
measure of quality across recall levels. Among the evaluation measures, MAP has
been shown to have especially good discrimination and stability. For a single
information need, Average Precision is the average of the precision value obtained for
the set of top k documents existing after each relevant document is retrieved, and this
value is then averaged over information needs. That is, if the set of relevant
documents for an information need qj ∈ Q is {d1, . . . dmj} and Rjk is the set of ranked
retrieval results from the top result until user gets to document dk, then
When a relevant document is not retrieved at all, 1 the precision value in the
above equation is taken to be 0. For a single information need, the average precision
approximates the area under the uninterpolated precision-recall curve, and so the
MAP is roughly the average area under the precision-recall curve for a set of queries.
145
There is always an argument for the effectiveness of stemming;[124] reported
that different outcomes and results are possible for a steaming process. [125] reported
that the influence of stemming in overall performance is little if effectiveness is
measured by traditional precision and recall measures. It is true if the language is with
simple morphology like English. But [54] proved that stemming process significantly
improves the retrieval effectiveness, mainly precision, for short queries or for
languages with high complex morphology, like romance language [73].
Mainly the performance of a stemmer is measured in terms of its contribution
to enrich the performance of an IR system. In order to evaluate proposed unsupervised
Telugu stemmer, two different experiments were conducted. The first experiment
evaluates the performance of stemmer in terms of accuracy, precision and recall. In
the second experiment, the performance of stemmer is measured in terms of its
capability to reduce the size of the index in an information retrieval task. For these
two experiments, two sets of suffixes are used. The performance of proposed stemmer
is evaluated and compared before and after normalization heuristic. The training data
is extracted from CIIL Mysore containing 129066 words from 200 documents. Table
4.1 shows an overview of the characteristics of the trained corpus. Two different test
runs are produced by randomly extracted words from Telugu daily newspapers.
146
Table 4.1: Trained Telugu corpus
Description Count
Total documents 200
Unique words 129066
Stem’s signatures before normalization 29501
Unique suffixes before normalization 22541
Regular suffixes before normalization 2746
Stem’s signatures after normalization 42273
Unique suffixes after normalization 14834
Refined suffixes after normalization 1583
Figure 4.6 compares the number of suffixes dropped before and after
normalization heuristic.
3000
2500
2000
1500
1000
500
0
with normalization without normalization
Figure 4.6: Number of suffixes

147
Experiment 1: To measure the performance of this stemmer accuracy, recall,
precision and F-score metrics are considered in the first experiment. Two different
test runs run-1 and run-2 are performed. The evaluation metrics used in this
experiment are described below:
Accuracy of proposed stemmer is defined as the number of words stemmed
correctly. This is calculated by comparing manually stemmed test words and stems of
test data that are produced by proposed stemmer.
Recall is defined as the ratio of length of stem generated by proposed stemmer
and the length of the actual stem that is identified manually. If the length of stem that
is produced by the proposed stemmer is less than the length of the actual stem then the
recall is treated as 100% because stem produced by proposed stemmer is part of the
actual stem. However, these types of words decrease the precision value because the
actual stem is having some extra characters. Mathematically, recall is calculated using
the below formulae:
Recall = 1, if the length of stem produced by proposed paradigm < length of actual
stem
∑
, otherwise
Precision is defined as the ratio of the length of the actual stem that is
identified manually and the length of the stem generated by the proposed stemmer. If
the length of stem that is produced by proposed stemmer is greater than the length of
the actual stem then precision is treated as 100% because all the characters present in
148
actual stem are also present in the stem that is produced by proposed stemmer.
Mathematically, precision is calculated using below formulae:
Precision = 1, if the length of the stem produced by proposed paradigm > length of
actual stem
∑
, otherwise
where n is the total number of test words.
The harmonic mean of recall and precision is defined as an F - score.
Mathematically F-score is represented using the below formula:
( )
The two test runs of the proposed stemmer before and after normalization are
shown in Table 4.2 and Table 4.3 respectively.
Table 4.2: Test results before normalization
Test Data Accuracy Precision Recall F-score
Run-1 34.80 82.90 96.92 89.37
Run-2 50.20 88.52 95.11 91.70

149
Table 4.3:Test results after normalization
Test Data Accuracy Precision Recall F-score
Run-1 69.89 93.94 91.97 92.94
Run-2 85.40 96.91 89.14 92.86
Figure 4.7 shows the comparison graph of accuracy, precision, recall and F-
score with and without normalization heuristic.
100
80
60
40
20
0
Accuracy Precision Recall F-score
with normalization without normalization
Figure 4.7: Performance comparison with & without normalization
Experiment 2: The performance of the proposed paradigm in terms of
reduction in index size of the Telugu information retrieval is measured in this
experiment. For this experiment document from the Telugu corpus is taken. The
difference between the number of index terms with and without stemming is
considered as a reduction in index size. Table 4.4 describes the comparison of
percentage reduction in index size of the proposed stemmer before normalization and
after applying normalization heuristic.

150
Table 4.4: Percentage reduction in index size
Before
No. of words After normalization
Test parameter normalization
without stemming heuristic
heuristic
No. of words Reference 77.14% 67.24%
Index size(MB) Reference 83.68% 76.05%
The comparison of index size in terms of number of words with and without
normalization heuristic is shown in Figure 4.4.
500
400
300
200
100
0
reference without with
normalization normalization
Figure 4.8: Index size comparison
4.5 RESULTS AND DISCUSSION
Table 4.2 describes that maximum accuracy is found to be 50.20% and F-score
is found to be 91.70% in run-2. Table 4.3 shows that the maximum accuracy is
85.40% in run-2 and F-score is 92.94% obtained after applying normalization
heuristic in run-1. By seeing all test runs in terms of accuracy, precision and F-score
151
proposed stemmer performs better after normalization heuristic. Recall obtained
before applying normalization heuristic is more than that obtained after applying
normalization heuristic. The recall and precision values in two test runs indicate that
over stemming is the main cause of the difference in accuracy. This accomplishment
is achieved with a large set of refined powerful suffixes.
The result of the reduction in the index size is shown in Table 4.4. The index
size without stemming is considered as a baseline. According to the number of index
terms the reduction is found to be 77.14% before applying normalization and 67.24%
after applying normalization. By considering the index size (measured in MB), the
percentage reductions are found to be 83.68% and 76.05% before applying
normalization and after applying normalization respectively.
4.6 SUMMARY
The results show that the proposed paradigm outperforms after applying
normalization heuristic. Proposed stemmer is unsupervised and language independent.
Corpus is used to derive set of powerful suffixes and it does not need any linguistic
knowledge. Hence, this approach can be used for developing stemmers for other
languages that are morphologically rich.
The performance of the proposed paradigm is evaluated in terms of accuracy,
precision, recall and F-score and compared before and after applying normalization
heuristic. Ability to reduce index size of the information retrieval task is also
measured by proposed stemmer before and after applying normalization. This shows
152
that percentage reduction in the index size is better for the proposed stemmer before
normalization.
The most widely accepted measure to evaluate the performance of stemmer by
using the IR system is Mean Average Precision (MAP) at overall relevant retrieved
documents over a fraction of retrieved documents. The effectiveness of proposed
stemmer is not evaluated in terms of MAP due to unavailability of the IR system and
other resources for Telugu language. This work will be the future scope.

Telstem:An Unsupervised Telugu Stemmer With Heuristic Improvements and Normalized Signatures

Uploaded by

Copyright:

Available Formats

Telstem:An Unsupervised Telugu Stemmer With Heuristic Improvements and Normalized Signatures

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Telstem:An Unsupervised Telugu Stemmer With Heuristic Improvements and Normalized Signatures

Uploaded by

Copyright:

Available Formats

CHAPTER 4

4.1 CONCEPT OF STEMMING

Linguistically, words follow the morphological rules that allow a speaker to

activate activating active activeness

activated activation actively actives

activates activations activeness etc.

Morphological suffix : activate (verb) à activation (noun)

Inflextional suffix : activation (noun) à activations (plural noun)

Stemming is a technique to transform different inflections and derivations of

least common denominator for the morphological variants.

4.1.1 COMPARISION OF STEMMING AND LEMMATIZATION

The motivation for using stemming instead of lemmatization, or indeed

well performing stemmer. It is also more expensive in terms of computational power

lemmatizer must do a complete morphological analysis (based on an actual

grammatical rules and a dictionary). Another point of motivation is that a stemmer

can deliberately “bring together” semantically related words belonging to different

word classes to the same stem, which a lemmatizer cannot.

The goal of both stemming and lemmatization is to reduce inflectional forms

and sometimes derivationally related forms of a word to a common base form.

use of a vocabulary and morphological analysis of words, normally aiming to remove

s, whereas lemmatization would attempt to return either see or saw depending on

stemming most commonly collapses derivationally related words, whereas

lemmatization commonly collapses only the different inflectional forms of a lemma.

Linguistic processing for stemming or lemmatization is often done by an additional

both commercial and open-source.

repeatedly been shown to be empirically very effective is Porter's algorithm. The

general nature. Porter's algorithm consists of 5 phases of word reductions, applied

the stem of a word.

lemmatizer, which needs a complete vocabulary and morphological analysis to

Language Processing which does full morphological analysis to accurately identify

normalization tends not to improve English information retrieval performance in

operate operating operates operation operative operatives operational

to “oper”. However, since “operate” is a common verb in this various forms, we

operational and research

4.1.2 TYPES OF STEMMING

The ability of an Information Retrieval (IR) system to conflate the words

deterioration of precision. Conflation also conforms to user's intuition because users

problem of automated conflation implementation is known as "stemming" algorithms.

One of the key questions in conflation algorithms is if an automated stemming

can be as much efficient (for IR purposes) as a manual processing. Another important

boundary or at a non-linguistically correct point.

So, despite that stemming often relies on knowledge of language morphology,

example, the Porter's algorithm [73] can make next productions:

probate -> probat

Apparently, the results are not morphologically right forms of words.

distinguishing stemming from lemmatization.

All stemming algorithms can be roughly classified as affix removing,

The major drawback of the affix removal approach is their dependency on a-

problem by finding distributions of root elements in a corpus. Such algorithms started

evolving only recently, as increase in computers power made feasible heavy

computations necessary for such approaches.

Mixed algorithms can combine several approaches. For example, an affix

removal algorithm can be enhanced by dictionary lookups for irregular verbs or

exceptional plural/singular forms like "feet/foot".

Variety of stemming algorithms essentially brings up a question about their

comparison. Though explicit measures like under-stemming (removing too less a

effect on search recall. Performance characteristics (speed and storage requirements)

are used as well.