Telstem:An Unsupervised Telugu Stemmer With Heuristic Improvements and Normalized Signatures
Telstem:An Unsupervised Telugu Stemmer With Heuristic Improvements and Normalized Signatures
Telstem:An Unsupervised Telugu Stemmer With Heuristic Improvements and Normalized Signatures
TELSTEM :AN
UNSUPERVISED TELUGU
STEMMER WITH
HEURISTIC
IMPROVEMENTS AND
NORMALIZED
SIGNATURES
112
derive variants of a same idea to evoke an action (verb), an object or concept (noun)
or the property of something (adjective). For instance, the following words are
derived from the same stem and share an abstract meaning of action and movement:
Stemming deduces the stem from a fully suffixed word according to its
morphological rules. These rules concern morphological and inflectional suffixes. The
former type usually changes the lexical category of words, whereas the latter indicates
plural and gender (in gender oriented languages such as French, Spanish and
German):
the same word to one common "stem". Stemming can mean both prefix and suffix
removal. Stemming can, for example, be used to ensure that the greatest number of
relevant matches is included in search results. A word's stem is its most basic form:
for example, the stem of a plural noun is the singular; the stem of a past-tense verb is
113
the present tense. The stem is, however, not to be confused with a word lemma, the
stem does not have to be an actual word itself. Instead the stem can be said to be the
tagging of the text, are mainly a question of cost. It is considerably more expensive, in
terms of time and effort, to develop a well performing lemmatizer than to develop a
and run time to use a lemmatizer than to use a stemmer. The reason for this is that the
stemmer can use ad-hoc suffix and prefix stripping rules and exception lists while the
The two words stemming and lemmatization differs in their flavor. Stemming
usually refers to a crude heuristic process that chops off the ends of words in the hope
of achieving this goal correctly most of the time, and often includes the removal of
derivational affixes. Lemmatization usually refers to doing things properly with the
inflectional endings only and to return the base or dictionary form of a word, which is
114
known as the lemma. If confronted with the token “saw”, stemming might return just
whether the use of the token was as a verb or a noun. The two may also differ in that
plug-in component to the indexing process and a number of such components exist in
The most common algorithm for stemming English, and the one that has
entire algorithm is too long and intricate to present here, but we will indicate its
sequentially. Within each phase, there are various conventions to select rules, such as
selecting the rule from each rule group that applies to the longest suffix.
Many of the later rules use a concept of the measure of a word, which loosely
checks the number of syllables to see whether a word is long enough that it is
reasonable to regard the matching portion of a rule as a suffix rather than as part of
Stemmers use language-specific rules, but they require less knowledge than a
correctly lemmatize words. Particular domains may also require special stemming
rules. However, the exact stemmed form does not matter, only the equivalence classes
it forms.
115
Rather than using a stemmer, you can use a lemmatizer , a tool from Natural
the lemma for each word. Doing the full morphological analysis produces at most
very modest benefits for retrieval. It is hard to say more, because either form of
aggregate - at least not by very much. While it helps a lot for some queries, it equally
hurts performance a lot of others. Stemming increases recall while harming precision.
As an example of what can go wrong, note that the Porter stemmer stems all of the
following words:
would expect to lose considerable precision on queries such as the following with
Porter stemming:
For a case like this, moving to using a lemmatizer would not completely fix
the problem. Because, particular inflectional forms are used in particular collocations:
for a sentence with any of the various forms of the word “operate”, the system is not a
good match for the query “operating and system”. Getting better value from term
normalization depends more on pragmatic issues of word use than on formal issues of
linguistic morphology.
116
allows reducing index and enhancing recall sometimes even without significant
don't need to worry about the "proper" morphological form of words in a query. The
and related problem is if a word should be truncated only on the right root morpheme
its goal is not to find a proper meaningful root of a word. Instead, a word can be
truncated at a position "incorrect" from the natural language point of view. For
Nevertheless, since the document index and queries are stemmed "invisibly" for a
user, this particularity should not be considered as a flaw, but rather as a feature
statistical and mixed. Affix removal stemmers apply set of transformation rules to
each word, trying to cut off known prefixes or suffixes. First such algorithm was
described by [53].
priory knowledge of language morphology. Statistical algorithms try to cope with this
suffix) and over-stemming (removing too much) do exist, they are hard to use due to
lack of a standard testing set. So usually, stemmers are compared indirectly with their
Co n f l a t i o n M e t h o d s
A u t o ma t i c
Ma n u a l
( S t e mme r s )
))))
Affix Table
Successor n-gram
R e mo v a l L o o ku p
Longest S i mp l e
Ma t c h R e mo v a l
Store a table of all index terms and their stems, so that terms from queries and
PROBLEMS
There is no such data for English. Or some terms are domain dependent.
The storage overhead for such a table, though trading size for time is
sometimes warranted.
119
R 3 E,I,O
RE 2 A,D
REA 1 D
READ 3 A,I,S
READA 1 B
READAB 1 L
READABL 1 E
READABLE 1 (Blank)
120
cutoff method
complete method
entropy method
| Di :| the number of words in a text body beginning with the i length
sequence of letters
26
| Dij | | Dij |
Hi log 2
j 1 | Di | | Di |
2. The number of correct segment cuts divided by the total number of the
true boundaries
After segmenting, if the first segment occurs in more than 12 words in the
corpus, it is probably a prefix.
N-GRAM STEMMERS
statistics => st ta at ti is st ti ic cs
unique digrams = at cs ic is st ta ti
statistical => st ta at ti is st ti ic ca al
unique digrams = al at ca ic is st ta ti
2C 2*6
S .80
A B 78
A and B are the numbers of unique diagrams in the first and the second
words. C is the number of unique diagrams shared by A and B.
Similarity measures are determined for all pairs of terms in the database,
forming a similarity matrix
Once such similarity matrix is available in which the terms are clustered using
a single link clustering method.
Affix removal algorithms remove suffixes and/or prefixes from terms leaving
a stem
Conditions on rules
stemmers operating on natural words inevitably make mistakes. On the one hand,
words which ought to be merged together (such as "adhere" and "adhesion") may
remain distinct after stemming; on the other, words which are really distinct may be
for a sample of words, we can gain an insight into the operation of a stemmer, and
To enable the errors to be counted, the words in the collection must already be
"adhering", "adhesion", "adhesive") which ought all to be merged to the same stem.
If the two words belong to the same conceptual group, and are converted to
the same stem, then the conflation is correct; if however they are converted to
different stems, this is counted as an understemming error.
If the two words belong to different conceptual groups, and remain distinct
after stemming, then the stemmer has behaved correctly. If however they are
converted to the same stem, this is counted as an overstemming error.
123
not need to include several forms of a same word in a query: "sale OR sales".
Stemming handles these variations such that the queries "sale" and "sales" will return
Reduces index size. Indexed terms are stored in the lexicon. A stemmed index
means that the linguistic roots are stored in the lexicon rather than in the whole terms.
Each entry in the lexicon contains references to documents in which it appears. When
by a single query, the desired documents may be returned with a large number of
other related documents. If a query returns too many documents, narrow it down by
Exceptions to stemming rules. Although stemming has been used since the
early days of computer science, algorithms that were developed for that purpose still
suffer from the same limitation, that is, the conflation of words that are not
semantically related, such as "university" and "universal". This means that a query
124
containing the term "universal" may retrieve a document that contains "university".
them in terms of memory and speed, given their relative rareness. It is then required to
understand the stemming mechanisms to recognize and bypass these situations when
they happen, because Coveo Enterprise Search's "Optimized for English" and
Impact on advanced syntax and exact match. Advanced syntax and exact
match queries work the same, but the terms in such queries are stemmed. In the case
of advanced syntax, it can lead to unexpected results, if different terms in the query
share the same stem. For instance, "consumers NOT consuming" is stemmed to
"consum- NOT consum-" and leads to no result. Exact match queries can also
occasionally retrieve non expected results because Coveo Enterprise Search looks for
exact matches of stems. For instance, the exact match string "sales reports" matches
"sal- report-".
such as French, Spanish and German, although, this very rarely leads to confusion
between stems.
The short words are not stemmed. Words of four or fewer letters are not
stemmed. Short words tend to be more error prone to stemming; especially in English,
where concerns must be addressed to recognize short and long syllables. Since this
situation refers to only a small number of terms, the short words are not stemmed to
125
For this reason, a number of so-called stemming Algorithms, or stemmers, have been
developed, which attempt to reduce a word to its stem or root form. Thus, the key
terms of a query or document are represented by stems rather than by the original
words. This not only means that different variants of a term can be conflated to a
single representative form – it also reduces the dictionary size, that is, the number of
distinct terms needed for representing a set of documents. A smaller dictionary size
For IR purposes, it doesn't usually matter whether the stems generated are
provided that (a) different words with the same 'base meaning' are conflated to the
same form, and (b) words with distinct meanings are kept separate. An algorithm
which attempts to convert a word to its linguistically correct root ("compute" in this
such as Lycos and Google, and also thesauri and other products using NLP for the
purpose of IR. Stemmers and lemmatizers also have applications more widely within
From the last five decades, many stemming algorithms were proposed and
implemented to improve information retrieval task. The first stemming algorithm was
written by [53]. This is single pass, context sensitive and uses longest suffix stripping
algorithm. Lovins stemmer maintains 250 suffixes. These suffixes were used in
stemming of strange words. This stemmer was remarkable for its early date and had
great influence at later work in this area. Later Martin Porter designed and
Cambridge [52]. It is five step procedure, some rules were applied under each step. If
a word is matched with the suffix rule, then the condition attached to that rule is
executed to get the stem. Porter stemmer is a linear process. It is widely used as it is
Cambridge University developed one stemmer that is based on the Lovins stemmer by
extending the suffix list to 1200 and modifying recoding rules.[54] developed another
The techniques used in these stemmers were rule based. Preparing such rules
for a new language is time consuming and these techniques are not applicable to
This is based on Bayesian model for English and French languages.[56] uses the
stemmer for Hindi [58], statistical stemmer for Hindi, unsupervised stemmer for
Hindi [59], unsupervised morphological analyser for Bengali, hybrid stemmer for
Gujarati [60] and rule based Telugu morphological generator (TelMore) for Telugu
[61]. Hybrid stemmer is not completely unsupervised. This stemmer uses handcrafted
suffix list. TelMore generates the morphological forms of the verb and noun by using
heuristic is based on split-all method that uses Boltzmann distribution. After applying
unsupervised Telugu stemmer for the effective Telugu information retrieval task.
the mutual reinforcement between stems and derivations. The idea here is to consider
prefixes and suffixes. It then shows how the estimation of the probabilities of the
model relates to the notion of mutual reinforcement and to the discovery of the
outperforms both light weight stemmer [58] and UMass stemmer, it does not require
any linguistic input. Hence, it can be easily adapted to other languages. The approach
used in this work is unsupervised as it does not require inflection-root pair for
because it does not require any language specific rules as input. The approach does
not require any domain specific knowledge; hence it is domain independent as well.
The algorithm is computationally inexpensive. The number of suffixes in the final list
heuristic repairs have been performed to further refine the learned suffixes. For this
stemmer, the training data has been constructed by extracting 106403 words extracted
from EMILLE corpus. The observed accuracy was found to be 89.9%, after applying
some heuristic measures. The F-score was 94.96%. An Unsupervised Hindi stemmer
with heuristic improvements approach is partly in line with [56] approach. It is based
documents from EMILLE corpus have been extracted. These words have been split to
give n-gram (n=1, 2, 3 … l) suffix, where l is length of the word. Then it computes
suffix and stem probability. These probabilities are multiplied to give a split
probability. The optimal segment corresponds to the maximum split probability. Some
to give morphological forms of nouns and verbs. The existing Telugu morphological
129
analyzer (TMA) is rule based. The performance of it is further improved by the novel
approach [63], which provides a system that gives information about possible
decompositions, the root word could be extracted for those words which were
a Telugu text corpus from CIIL Mysore and the improvement in the performance is
checked by the rule based morphological analyzer developed by a LTRC group, IIIT
and HCU, Hyderabad. The observed increase in performance of rule based is from
77% to 84.2% for words which are in the hundreds. It can still be improved if the
corpus is increased.
using corpora ranging in size from 5,000 words to 500,000 words. They developed a
set of heuristics that rapidly develop a probabilistic morphological grammar, and use
MDL as the primary tool to determine whether the modifications proposed by the
heuristics will be adopted or not. The resulting grammar matches well the analysis
that would be developed by a human morphologist. In the final section, they discuss
the relationship of this style of MDL grammatical analysis to the notion of evaluation
stage back off strategy for improving the coverage of dictionary-based translation
be achieved using a four - stage back off translation in conjunction with freely
algorithms were tried considerably across the three languages to which they were
applied.
Indian language sub-task of the main Adhoc monolingual and bilingual track, the task
Bengali and Marathi. Groups participating in this track were required to submit an
English to English monolingual run and a Hindi to English bilingual run with optional
runs in the rest of the languages. Their submission consisted of a monolingual English
run and a Hindi to English cross-lingual run. A cross - Lingual Information Retrieval
System for Indian Languages [65] used a word alignment table that was taught by a
map a query in the source language into an equivalent query in the language of the
document collection. The relevant documents are then retrieved using a Language
Modeling based retrieval algorithm. On the CLEF 2007 data set, this official cross-
lingual performance was 54.4% of the monolingual performance and in the post
when exploring the Hungarian and Bulgarian documents and to evaluate these
Approaches for East European Languages [66] generally show that for the Bulgarian
131
language, removing certain frequently used derivational suffixes may improve mean
procedure improves the MAP. For the Czech language a comparison of a light and a
more aggressive stemmer to remove both inflectional and some derivational suffixes,
reveal only small performance differences. For this language only, performance
for Gujarati using a hybrid approach [60] harnessed the linguistic knowledge in the
form of a handcrafted Gujarati suffix list in order to improve the quality of the stems
and suffixes learnt during the training phase. They used the EMILLE corpus for
training and evaluating the stemmer’s performance. The use of handcrafted suffixes
boosted the accuracy of this stemmer by about 17% and helped to achieve an accuracy
of 67.86 %.
to discover equivalence classes of root words and their morphological variants. A set
of string distance measures are defined, and the lexicon for a given text collection is
clustered using the distance measures to identify these equivalence classes. This
approach is compared with Porter’s and Lovin’s stemmers on the AP and the WSJ sub
collections of the Tipster datasets using 200 queries. Its performance is comparable to
that of Porter’s and Lovin’s stemmers, both in terms of average precision and the total
consistent improvements in retrieval performance for French and Bengali, which are
currently resource-poor.
132
based on taking all splits of each and every word from the corpus. Goldsmith's
heuristic does not require any linguistic knowledge thus it is completely unsupervised.
For unsupervised studying, words from Telugu corpora are considered. To refine the
learned suffixes normalization heuristic is applied. Each word from the corpus is
being considered in l-1 different ways, splitting the word into stem+suffix after i
letters, where 1<=i<=l-1 and l is length of the word. Then the frequencies of the stems
and suffixes of each split of words are computed. These frequencies are used to get an
optimal split of the word. The optimal segment unifies stem and suffix. Making
analogous signatures and removal of spurious signatures are applied to get regular
suffixes. Figure 4.2 shows the layout of the proposed paradigm. The details of these
Telugu corpus is given as input to this process. The corpus consists of unique
words. 129066 words are considered as corpus. Take-all-splits heuristic is used for
word segmentation and it uses all cuts of a word of length l into stem+suffix w1,i +
wi+1,l, where w refers to the word and 1≤ i < l. This heuristic assigns a value to each
split of the word w of length l: w1, i +wi+1, l. This heuristic takes all splits of a word wl
{ }
133
where stem1i refers to prefix of word wl with length i and suffixi+1l refers to the
suffix of the same word with length l-i. For example, the word ‘అధికారము’ is split
Trained Telugu
corpus
Word segmentation
with Take-All-Splits Heuristic
Normalization to enhance
proposed paradigm
List of powerful
suffixes
Then, stems and suffixes are stored in the database and heuristic value of each split is
The Heuristic value at split i = i log freq (stem = w1,i) + (l - i) log freq (suffix = wi+1,l)
As the value of i changes the heuristic value also changes. Heuristic value
mainly depends upon the frequency of stem and suffix in the corpus. The frequency of
stem and suffix is varying as the length of the stem and the suffix is increasing. The
frequencies of shorter stems and suffixes are very high when compared to the slightly
longer stems and suffixes. Thus the multipliers i (length of the stem) and l-i (length of
the suffix) are introduced in this heuristic in order to compensate for this disparity.
The output of the first heuristic is set of heuristic values for each word. The
split with highest heuristic is treated to be an optimal split of the word. This
segmentation mainly depends upon the frequency distribution of the stems and
suffixes of the word, language and corpus. Some restrictions are considered while
taking optimal split. First, the word length is restricted to a minimum of three. From
second restriction is based on the heuristic value of each split. If any heuristic value of
the split is zero, it is not taken into account. By this restriction lengthy word like
40
35
30
Heuristic values
25
20
15
10
5
0
1 2 3 4 5 6 7 8
The index value is considered on the x-axis and heuristic values are considered
on the y-axis. By looking into the graph, the heuristic values of the word are increased
first and then decreased. The point where the heuristic value starts decreasing is
considered as optimal split of the word. As the stem length increases up to a certain
limit, the frequency of stem starts decreasing. That is the reason why heuristic value
starts decreasing. This step gives the generic stems and generic suffixes of the given
corpus.
An optimal split is assigned to every word of the corpus into the generic stem
and generic suffix by the above step. After getting generic stems and generic suffixes,
136
the suffixes that are having the same stem are grouped together. This is called stem’s
{ }{ }
considering Telugu word example, క ుండా, వచ్ ు, తారు suffixes have same stem
క ుండా
{అుంతరుంచిపో } { వచ్ ు }
తారు
Signature can also be referred to the set of suffixes along with the associated
set of stems. The stems having the same set of suffixes are grouped. Here stem1, stem2
have the same set of suffixes suffix1, suffix2, suffix3 and suffix4. The presence of the
{ }{ }
Here, at least two members are present in each set and all combinations
present in this structure are available in the corpus and each stem is found with no
other suffix, but the suffix may well appear in other signatures.
137
Consider the Telugu words, అుంతస్ు , అుంతరక్షనౌక and అకాడమీ. This three
words can have same set of suffixes లో, ల and కి. Grouping of set of stems that are
అుంతస్ు లో
{అుంతరక్షనౌక} {ల }
అకాడమీ కి
All signatures that are associated with only one stem and only one suffix are
discarded. These are called irregular signatures. The suffixes that are generated by this
4.3.4 NORMALIZATION
having a similar stem involve some common set of prefixes. Thus, to enhance the
normalizes the prefixes of suffixes that are common to the same stem. This is shown
suffixi = a1b1suffix1
suffixj = a1b1suffix2
suffixk = a1b1suffix3
suffixl = c1d1e1suffix4
suffixm = c1d1e1suffix5
138
Suffixi, suffixj, suffixk, suffixl and suffixm are having same stem stemi. The
application of this heuristic concatenates stemi with prefixes of suffixes a1b1, c1d1e1
and modifies stem and suffixes. The stems and suffixes after applying normalization
heuristic are stemia1b1, stemic1d1e1, suffix1, suffix2, suffix3, suffix4 and suffix5.
below:
ిుంగ
ిుంగుం
ిుంగుంలో
ిుంగాన
ిుంగక
తవమని
{అుంతర} తవమున
ము
మున
మునక
ముపై
{ముయొకక}
After applying normalization heuristic to the word ‘అుంతర’, the obtained stems
ిుం
ిుంలో
ిాన
ిక
అుంతరుంగ ని
{అుంతరతవమ} ిన
అుంతరము
న
నక
పై
{యొకక}
139
The suffixes that are obtained after normalization process are refined suffixes.
This heuristic is applied to control over stemming. These refined suffixes can be used
The output of the above step gives set of signatures. Again all signatures that
are associated with only one stem and only one suffix are discarded. These signatures
are called spurious signatures. To enrich the performance of the proposed paradigm
these signatures are removed. The remaining signatures are called refined signatures
and the suffixes that are dropped from them are called refined suffixes. The refined
suffixes are not quite the suffixes that can be used to establish the morphology. But
they are very good for approximation and useful for stemming process.
sets of suffixes are available. The first suffix list is dropped before normalization
heuristic. A second suffix list is generated after applying normalization process. Each
strange words the suffixes are organized in decreasing order of their lengths. The
longest possible suffix of the strange word that is matched with some suffix in the list
is dropped. This way is called longest suffix matching algorithm. Matching is started
from first suffix to last suffix. Wherever a match is found, a strange word is
segmented into the stem and suffix. Figure 4.4 shows the sample output of the
proposed paradigm.
4.4 EXPERIMENTS
system is to retrieve all the relevant documents while retrieving as few non-relevant
interrelated measures of precision and recall. Precision is defined to be the ratio of the
Recall is defined to be the ratio of the number of relevant documents retrieved to the
The ideal IR system would achieve 100% recall and 100% precision for all
queries. However, in practice precision and recall tend to vary inversely. A very
specific query formulation tends to produce high precision. But, it also generally
results in the low recall. Conversely, a broader query formulation is likely to retrieve a
greater pool of documents, but the portion relevant is normally smaller, resulting in
lower precision. Most attempts at improving one variable tend to have a negative
documents in the collection. For a small document collection, this is possible. But for
practical. Recall figures for larger collections are often calculated by estimating the
142
number of relevant documents. Using sampling techniques is one method for doing
techniques. The top results of these searches are combined to produce the relevant
documents. This technique, called pooling, is based on the assumption that the
documents.
all the terms appearing in the document, a more flexible approach is to compute the
probability of relevance based on the number or frequency of query terms that appear
in the document. The more terms that occur, the greater the probability that the
document is relevant. A document that contains even one of the terms is viewed as a
potential answer, but documents that contain all or most of the terms will receive the
143
The formula that measures the performance a user can expect from a system
A single measure that trades off precision versus recall is the F- measure,
where α [0, 1] and thus β2 [0,∞]. The default balanced F measure equally
written as F1, which is short for Fβ=1, even though the formulation in terms of α more
However, using an even weighting is not the only choice. Values of β < 1
emphasize precision, while values of β > 1 emphasize recall. For example, a value of
the F measure are inherent measures between 0 and 1, but they are also very
measure of quality across recall levels. Among the evaluation measures, MAP has
been shown to have especially good discrimination and stability. For a single
information need, Average Precision is the average of the precision value obtained for
the set of top k documents existing after each relevant document is retrieved, and this
value is then averaged over information needs. That is, if the set of relevant
documents for an information need qj ∈ Q is {d1, . . . dmj} and Rjk is the set of ranked
retrieval results from the top result until user gets to document dk, then
When a relevant document is not retrieved at all, 1 the precision value in the
above equation is taken to be 0. For a single information need, the average precision
approximates the area under the uninterpolated precision-recall curve, and so the
MAP is roughly the average area under the precision-recall curve for a set of queries.
145
that different outcomes and results are possible for a steaming process. [125] reported
measured by traditional precision and recall measures. It is true if the language is with
simple morphology like English. But [54] proved that stemming process significantly
improves the retrieval effectiveness, mainly precision, for short queries or for
Telugu stemmer, two different experiments were conducted. The first experiment
capability to reduce the size of the index in an information retrieval task. For these
two experiments, two sets of suffixes are used. The performance of proposed stemmer
is evaluated and compared before and after normalization heuristic. The training data
is extracted from CIIL Mysore containing 129066 words from 200 documents. Table
4.1 shows an overview of the characteristics of the trained corpus. Two different test
runs are produced by randomly extracted words from Telugu daily newspapers.
146
Description Count
Figure 4.6 compares the number of suffixes dropped before and after
normalization heuristic.
3000
2500
2000
1500
1000
500
0
with normalization without normalization
precision and F-score metrics are considered in the first experiment. Two different
test runs run-1 and run-2 are performed. The evaluation metrics used in this
correctly. This is calculated by comparing manually stemmed test words and stems of
and the length of the actual stem that is identified manually. If the length of stem that
is produced by the proposed stemmer is less than the length of the actual stem then the
recall is treated as 100% because stem produced by proposed stemmer is part of the
actual stem. However, these types of words decrease the precision value because the
actual stem is having some extra characters. Mathematically, recall is calculated using
Recall = 1, if the length of stem produced by proposed paradigm < length of actual
stem
∑
, otherwise
Precision is defined as the ratio of the length of the actual stem that is
identified manually and the length of the stem generated by the proposed stemmer. If
the length of stem that is produced by proposed stemmer is greater than the length of
the actual stem then precision is treated as 100% because all the characters present in
148
actual stem are also present in the stem that is produced by proposed stemmer.
Precision = 1, if the length of the stem produced by proposed paradigm > length of
actual stem
∑
, otherwise
( )
The two test runs of the proposed stemmer before and after normalization are
Figure 4.7 shows the comparison graph of accuracy, precision, recall and F-
100
80
60
40
20
0
Accuracy Precision Recall F-score
experiment. For this experiment document from the Telugu corpus is taken. The
difference between the number of index terms with and without stemming is
percentage reduction in index size of the proposed stemmer before normalization and
Before
No. of words After normalization
Test parameter normalization
without stemming heuristic
heuristic
The comparison of index size in terms of number of words with and without
500
400
300
200
100
0
reference without with
normalization normalization
Table 4.2 describes that maximum accuracy is found to be 50.20% and F-score
is found to be 91.70% in run-2. Table 4.3 shows that the maximum accuracy is
heuristic in run-1. By seeing all test runs in terms of accuracy, precision and F-score
151
before applying normalization heuristic is more than that obtained after applying
normalization heuristic. The recall and precision values in two test runs indicate that
over stemming is the main cause of the difference in accuracy. This accomplishment
The result of the reduction in the index size is shown in Table 4.4. The index
terms the reduction is found to be 77.14% before applying normalization and 67.24%
after applying normalization. By considering the index size (measured in MB), the
4.6 SUMMARY
The results show that the proposed paradigm outperforms after applying
Corpus is used to derive set of powerful suffixes and it does not need any linguistic
knowledge. Hence, this approach can be used for developing stemmers for other
precision, recall and F-score and compared before and after applying normalization
heuristic. Ability to reduce index size of the information retrieval task is also
measured by proposed stemmer before and after applying normalization. This shows
152
that percentage reduction in the index size is better for the proposed stemmer before
normalization.
using the IR system is Mean Average Precision (MAP) at overall relevant retrieved
stemmer is not evaluated in terms of MAP due to unavailability of the IR system and
other resources for Telugu language. This work will be the future scope.