Telstem:An Unsupervised Telugu Stemmer With Heuristic Improvements and Normalized Signatures

Download as pdf or txt
Download as pdf or txt
You are on page 1of 42

CHAPTER 4

TELSTEM :AN
UNSUPERVISED TELUGU
STEMMER WITH
HEURISTIC
IMPROVEMENTS AND
NORMALIZED
SIGNATURES
112

4.1 CONCEPT OF STEMMING

Linguistically, words follow the morphological rules that allow a speaker to

derive variants of a same idea to evoke an action (verb), an object or concept (noun)

or the property of something (adjective). For instance, the following words are

derived from the same stem and share an abstract meaning of action and movement:

activate activating active activeness

activated activation actively actives

activates activations activeness etc.

Stemming deduces the stem from a fully suffixed word according to its

morphological rules. These rules concern morphological and inflectional suffixes. The

former type usually changes the lexical category of words, whereas the latter indicates

plural and gender (in gender oriented languages such as French, Spanish and

German):

Morphological suffix : activate (verb) à activation (noun)

Inflextional suffix : activation (noun) à activations (plural noun)

Stemming is a technique to transform different inflections and derivations of

the same word to one common "stem". Stemming can mean both prefix and suffix

removal. Stemming can, for example, be used to ensure that the greatest number of

relevant matches is included in search results. A word's stem is its most basic form:

for example, the stem of a plural noun is the singular; the stem of a past-tense verb is
113

the present tense. The stem is, however, not to be confused with a word lemma, the

stem does not have to be an actual word itself. Instead the stem can be said to be the

least common denominator for the morphological variants.

4.1.1 COMPARISION OF STEMMING AND LEMMATIZATION

The motivation for using stemming instead of lemmatization, or indeed

tagging of the text, are mainly a question of cost. It is considerably more expensive, in

terms of time and effort, to develop a well performing lemmatizer than to develop a

well performing stemmer. It is also more expensive in terms of computational power

and run time to use a lemmatizer than to use a stemmer. The reason for this is that the

stemmer can use ad-hoc suffix and prefix stripping rules and exception lists while the

lemmatizer must do a complete morphological analysis (based on an actual

grammatical rules and a dictionary). Another point of motivation is that a stemmer

can deliberately “bring together” semantically related words belonging to different

word classes to the same stem, which a lemmatizer cannot.

The goal of both stemming and lemmatization is to reduce inflectional forms

and sometimes derivationally related forms of a word to a common base form.

The two words stemming and lemmatization differs in their flavor. Stemming

usually refers to a crude heuristic process that chops off the ends of words in the hope

of achieving this goal correctly most of the time, and often includes the removal of

derivational affixes. Lemmatization usually refers to doing things properly with the

use of a vocabulary and morphological analysis of words, normally aiming to remove

inflectional endings only and to return the base or dictionary form of a word, which is
114

known as the lemma. If confronted with the token “saw”, stemming might return just

s, whereas lemmatization would attempt to return either see or saw depending on

whether the use of the token was as a verb or a noun. The two may also differ in that

stemming most commonly collapses derivationally related words, whereas

lemmatization commonly collapses only the different inflectional forms of a lemma.

Linguistic processing for stemming or lemmatization is often done by an additional

plug-in component to the indexing process and a number of such components exist in

both commercial and open-source.

The most common algorithm for stemming English, and the one that has

repeatedly been shown to be empirically very effective is Porter's algorithm. The

entire algorithm is too long and intricate to present here, but we will indicate its

general nature. Porter's algorithm consists of 5 phases of word reductions, applied

sequentially. Within each phase, there are various conventions to select rules, such as

selecting the rule from each rule group that applies to the longest suffix.

Many of the later rules use a concept of the measure of a word, which loosely

checks the number of syllables to see whether a word is long enough that it is

reasonable to regard the matching portion of a rule as a suffix rather than as part of

the stem of a word.

Stemmers use language-specific rules, but they require less knowledge than a

lemmatizer, which needs a complete vocabulary and morphological analysis to

correctly lemmatize words. Particular domains may also require special stemming

rules. However, the exact stemmed form does not matter, only the equivalence classes

it forms.
115

Rather than using a stemmer, you can use a lemmatizer , a tool from Natural

Language Processing which does full morphological analysis to accurately identify

the lemma for each word. Doing the full morphological analysis produces at most

very modest benefits for retrieval. It is hard to say more, because either form of

normalization tends not to improve English information retrieval performance in

aggregate - at least not by very much. While it helps a lot for some queries, it equally

hurts performance a lot of others. Stemming increases recall while harming precision.

As an example of what can go wrong, note that the Porter stemmer stems all of the

following words:

operate operating operates operation operative operatives operational

to “oper”. However, since “operate” is a common verb in this various forms, we

would expect to lose considerable precision on queries such as the following with

Porter stemming:

operational and research


operating and system
operative and dentistry

For a case like this, moving to using a lemmatizer would not completely fix

the problem. Because, particular inflectional forms are used in particular collocations:

for a sentence with any of the various forms of the word “operate”, the system is not a

good match for the query “operating and system”. Getting better value from term

normalization depends more on pragmatic issues of word use than on formal issues of

linguistic morphology.
116

4.1.2 TYPES OF STEMMING

The ability of an Information Retrieval (IR) system to conflate the words

allows reducing index and enhancing recall sometimes even without significant

deterioration of precision. Conflation also conforms to user's intuition because users

don't need to worry about the "proper" morphological form of words in a query. The

problem of automated conflation implementation is known as "stemming" algorithms.

One of the key questions in conflation algorithms is if an automated stemming

can be as much efficient (for IR purposes) as a manual processing. Another important

and related problem is if a word should be truncated only on the right root morpheme

boundary or at a non-linguistically correct point.

So, despite that stemming often relies on knowledge of language morphology,

its goal is not to find a proper meaningful root of a word. Instead, a word can be

truncated at a position "incorrect" from the natural language point of view. For

example, the Porter's algorithm [73] can make next productions:

probate -> probat


cease -> ceas

Apparently, the results are not morphologically right forms of words.

Nevertheless, since the document index and queries are stemmed "invisibly" for a

user, this particularity should not be considered as a flaw, but rather as a feature

distinguishing stemming from lemmatization.


117

All stemming algorithms can be roughly classified as affix removing,

statistical and mixed. Affix removal stemmers apply set of transformation rules to

each word, trying to cut off known prefixes or suffixes. First such algorithm was

described by [53].

The major drawback of the affix removal approach is their dependency on a-

priory knowledge of language morphology. Statistical algorithms try to cope with this

problem by finding distributions of root elements in a corpus. Such algorithms started

evolving only recently, as increase in computers power made feasible heavy

computations necessary for such approaches.

Mixed algorithms can combine several approaches. For example, an affix

removal algorithm can be enhanced by dictionary lookups for irregular verbs or

exceptional plural/singular forms like "feet/foot".

Variety of stemming algorithms essentially brings up a question about their

comparison. Though explicit measures like under-stemming (removing too less a

suffix) and over-stemming (removing too much) do exist, they are hard to use due to

lack of a standard testing set. So usually, stemmers are compared indirectly with their

effect on search recall. Performance characteristics (speed and storage requirements)

are used as well.


118

Co n f l a t i o n M e t h o d s

A u t o ma t i c
Ma n u a l
( S t e mme r s )
))))

Affix Table
Successor n-gram
R e mo v a l L o o ku p

Longest S i mp l e
Ma t c h R e mo v a l

Figure 4.1: Types of stemming

TYPE OF STEMMING ALGORITHMS


 Table lookup approach
 Successor Variety
 n-gram stemmers
 Affix Removal Stemmers

TABLE LOOKUP APPROACH

Store a table of all index terms and their stems, so that terms from queries and

indexes could be stemmed very fast.

PROBLEMS
 There is no such data for English. Or some terms are domain dependent.

 The storage overhead for such a table, though trading size for time is

sometimes warranted.
119

SUCCESSOR VARIETY APPROACH


 Determine word and morpheme boundaries based on the distribution of
phonemes on a large body of utterances.

 The successor variety of a string is the number of different characters that


follow it in words in some body of text.

 The successor variety of substrings of a term will decrease as more characters


are added until a segment boundary is reached.

Test Word: READABLE

Corpus: ABLE, APE, BEATABLE, FIXABLE, READ, READABLE,

READING, READS, RED, ROPE, RIPE

Prefix Successor Variety Letters

R 3 E,I,O

RE 2 A,D

REA 1 D

READ 3 A,I,S

READA 1 B

READAB 1 L

READABL 1 E

READABLE 1 (Blank)
120

 cutoff method

 some cutoff value is selected and a boundary is identified whenever the


cutoff value is reached

 peak and plateau method

 segment break is made after a character whose successor variety


exceeds that of the characters immediately preceding and following it

 complete method

 entropy method

 | Di :| the number of words in a text body beginning with the i length
sequence of letters 

 | Dij | : the number of words in Di with the successor j

 The probability that a member of number of words in Di

has the successor j is given by | Dij |


| Di |
 The entropy of | Di | is

26
| Dij | | Dij |
Hi     log 2
j 1 | Di | | Di |

 Two criteria used to evaluate various segmentation methods

1. The number of correct segment cuts divided by the total number of


cuts

2. The number of correct segment cuts divided by the total number of the
true boundaries

 After segmenting, if the first segment occurs in more than 12 words in the
corpus, it is probably a prefix.

 The successor variety stemming process has three parts

1. Determine the successor varieties for a word

2. Segment the word using one of the methods

3. Select one of the segments as the stem


121

N-GRAM STEMMERS

 Association measures are calculated between pairs of terms based on shared


unique digrams.

statistics => st ta at ti is st ti ic cs

unique digrams = at cs ic is st ta ti

statistical => st ta at ti is st ti ic ca al

unique digrams = al at ca ic is st ta ti

 Dice’s coefficient (similarity)

2C 2*6
S   .80
A B 78
A and B are the numbers of unique diagrams in the first and the second
words. C is the number of unique diagrams shared by A and B.

 Similarity measures are determined for all pairs of terms in the database,
forming a similarity matrix

 Once such similarity matrix is available in which the terms are clustered using
a single link clustering method.

AFFIX REMOVAL STEMMERS

 Affix removal algorithms remove suffixes and/or prefixes from terms leaving
a stem

 If a word ends in “ies” but not ”eies” or ”aies ”

Then “ies” -> “y”

 If a word ends in “es” but not ”aes” , or ”ees ” or “oes”

Then “es” -> “e”

 If a word ends in “s” but not ”us” or ”ss ”

Then “s” -> “NULL”

THE PORTER ALGORITHM

 The Porter algorithm consists of a set of condition/action rules.


122

 The condition fall into three classes

 Conditions on the stem

 Conditions on the suffix

 Conditions on rules

4.1.3 ERRORS IN STEMMING

Natural languages are not completely regular constructs and therefore

stemmers operating on natural words inevitably make mistakes. On the one hand,

words which ought to be merged together (such as "adhere" and "adhesion") may

remain distinct after stemming; on the other, words which are really distinct may be

wrongly conflated (e.g., "experiment" and "experience"). These are known as

understemming errors and overstemming errors respectively. By counting these errors

for a sample of words, we can gain an insight into the operation of a stemmer, and

compare different stemmers.

To enable the errors to be counted, the words in the collection must already be

organized into 'conceptual groups', containing words (such as "adhere", "adheres",

"adhering", "adhesion", "adhesive") which ought all to be merged to the same stem.

 If the two words belong to the same conceptual group, and are converted to
the same stem, then the conflation is correct; if however they are converted to
different stems, this is counted as an understemming error.

 If the two words belong to different conceptual groups, and remain distinct
after stemming, then the stemmer has behaved correctly. If however they are
converted to the same stem, this is counted as an overstemming error.
123

4.1.4 BENEFITS OF STEMMING

Takes care of morphological variants. When an index is stemmed, users do

not need to include several forms of a same word in a query: "sale OR sales".

Stemming handles these variations such that the queries "sale" and "sales" will return

the same results.

Reduces index size. Indexed terms are stored in the lexicon. A stemmed index

means that the linguistic roots are stored in the lexicon rather than in the whole terms.

Each entry in the lexicon contains references to documents in which it appears. When

stemming is used, several terms are conflated as well as information relative to

document references, hence significantly reducing the size of the index.

4.1.5 LIMITATIONS OF STEMMING

It is important to keep in mind that the stemming limitations tend to disappear

with queries including more than one term.

Occasional increased recall. Since more documents are likely to be retrieved

by a single query, the desired documents may be returned with a large number of

other related documents. If a query returns too many documents, narrow it down by

adding more words to the query.

Exceptions to stemming rules. Although stemming has been used since the

early days of computer science, algorithms that were developed for that purpose still

suffer from the same limitation, that is, the conflation of words that are not

semantically related, such as "university" and "universal". This means that a query
124

containing the term "universal" may retrieve a document that contains "university".

Although stemming rules' exceptions are limited, it would be expensive to handle

them in terms of memory and speed, given their relative rareness. It is then required to

understand the stemming mechanisms to recognize and bypass these situations when

they happen, because Coveo Enterprise Search's "Optimized for English" and

"Multilingual Stemming" both suffer from this limitation.

Impact on advanced syntax and exact match. Advanced syntax and exact

match queries work the same, but the terms in such queries are stemmed. In the case

of advanced syntax, it can lead to unexpected results, if different terms in the query

share the same stem. For instance, "consumers NOT consuming" is stemmed to

"consum- NOT consum-" and leads to no result. Exact match queries can also

occasionally retrieve non expected results because Coveo Enterprise Search looks for

exact matches of stems. For instance, the exact match string "sales reports" matches

"sal- report-".

Accented characters are not supported. For performance purposes only,

Coveo Enterprise Search removes accents from characters before stemming.

Consequently, some accented characters in stems lose their distinction in languages

such as French, Spanish and German, although, this very rarely leads to confusion

between stems.

The short words are not stemmed. Words of four or fewer letters are not

stemmed. Short words tend to be more error prone to stemming; especially in English,

where concerns must be addressed to recognize short and long syllables. Since this

situation refers to only a small number of terms, the short words are not stemmed to
125

favor performance. Therefore, morphological variations of short words will not be

implicitly included in a query.

4.2 RELATED WORK

In most cases, morphological variants of words have similar semantic

interpretations and can be considered as equivalent for the purpose of IR applications.

For this reason, a number of so-called stemming Algorithms, or stemmers, have been

developed, which attempt to reduce a word to its stem or root form. Thus, the key

terms of a query or document are represented by stems rather than by the original

words. This not only means that different variants of a term can be conflated to a

single representative form – it also reduces the dictionary size, that is, the number of

distinct terms needed for representing a set of documents. A smaller dictionary size

results in saving of storage space and processing time.

For IR purposes, it doesn't usually matter whether the stems generated are

genuine words or not – thus, "computation" might be stemmed to "comput" –

provided that (a) different words with the same 'base meaning' are conflated to the

same form, and (b) words with distinct meanings are kept separate. An algorithm

which attempts to convert a word to its linguistically correct root ("compute" in this

case) is sometimes called a lemmatizer.

Examples of products using stemming algorithms would be search engines

such as Lycos and Google, and also thesauri and other products using NLP for the

purpose of IR. Stemmers and lemmatizers also have applications more widely within

the field of Computational Linguistics.


126

From the last five decades, many stemming algorithms were proposed and

implemented to improve information retrieval task. The first stemming algorithm was

written by [53]. This is single pass, context sensitive and uses longest suffix stripping

algorithm. Lovins stemmer maintains 250 suffixes. These suffixes were used in

stemming of strange words. This stemmer was remarkable for its early date and had

great influence at later work in this area. Later Martin Porter designed and

implemented stemming algorithm for English language at the University of

Cambridge [52]. It is five step procedure, some rules were applied under each step. If

a word is matched with the suffix rule, then the condition attached to that rule is

executed to get the stem. Porter stemmer is a linear process. It is widely used as it is

readily available. [121] of the Literary and Linguistics Computing Centre at

Cambridge University developed one stemmer that is based on the Lovins stemmer by

extending the suffix list to 1200 and modifying recoding rules.[54] developed another

stemming algorithm at the University of Massachusetts, in 1993 and it is based on

morphology and uses dictionary.

The techniques used in these stemmers were rule based. Preparing such rules

for a new language is time consuming and these techniques are not applicable to

morphologically rich languages like Telugu.

Later unsupervised approaches came into the picture.[4]proposed

unsupervised morphology technique based on Minimum Description Length (MDL).

This is based on Bayesian model for English and French languages.[56] uses the

MDL framework to learn unsupervised morphological segmentation of European

languages.[57] presented language independent and unsupervised algorithm for word

segmentation that is based on the prior distribution of morpheme length and


127

frequency.[117] proposed stemming algorithm based on automatic clustering of words

using co-occurrence information.

Previous work on morphology of Indian languages includes lightweight

stemmer for Hindi [58], statistical stemmer for Hindi, unsupervised stemmer for

Hindi [59], unsupervised morphological analyser for Bengali, hybrid stemmer for

Gujarati [60] and rule based Telugu morphological generator (TelMore) for Telugu

[61]. Hybrid stemmer is not completely unsupervised. This stemmer uses handcrafted

suffix list. TelMore generates the morphological forms of the verb and noun by using

the rules of [122,123].

Word segmentation heuristic is based on a Goldsmith’s approach [56]. This

heuristic is based on split-all method that uses Boltzmann distribution. After applying

normalization heuristic, the performance of proposed stemmer is increased. The

proposed paradigm presented in this paper focuses on the development of an

unsupervised Telugu stemmer for the effective Telugu information retrieval task.

A probabilistic model for stemmer generation [62] is to take a step forward

from the graph-based stemming to introduce a probabilistic framework which models

the mutual reinforcement between stems and derivations. The idea here is to consider

stemming as the inverse of a machine which generates words by concatenating

prefixes and suffixes. It then shows how the estimation of the probabilities of the

model relates to the notion of mutual reinforcement and to the discovery of the

communities of stems and derivations.


128

An Unsupervised Hindi stemmer with heuristic improvements [59]

outperforms both light weight stemmer [58] and UMass stemmer, it does not require

any linguistic input. Hence, it can be easily adapted to other languages. The approach

used in this work is unsupervised as it does not require inflection-root pair for

training. This makes it easily applicable to a new language. It is language independent

because it does not require any language specific rules as input. The approach does

not require any domain specific knowledge; hence it is domain independent as well.

The algorithm is computationally inexpensive. The number of suffixes in the final list

is 51 and longest suffix stripping is used to perform stemming. Some post-processing

heuristic repairs have been performed to further refine the learned suffixes. For this

stemmer, the training data has been constructed by extracting 106403 words extracted

from EMILLE corpus. The observed accuracy was found to be 89.9%, after applying

some heuristic measures. The F-score was 94.96%. An Unsupervised Hindi stemmer

with heuristic improvements approach is partly in line with [56] approach. It is based

on split-all method. For unsupervised learning (training), words from Hindi

documents from EMILLE corpus have been extracted. These words have been split to

give n-gram (n=1, 2, 3 … l) suffix, where l is length of the word. Then it computes

suffix and stem probability. These probabilities are multiplied to give a split

probability. The optimal segment corresponds to the maximum split probability. Some

post-processing steps have been taken to refine the learned suffixes.

Language is very rich in literature, and it requires advancements in

computational approaches. Applications like machine translation, speech recognition,

speech synthesis and information retrieval need a powerful morphological generator

to give morphological forms of nouns and verbs. The existing Telugu morphological
129

analyzer (TMA) is rule based. The performance of it is further improved by the novel

approach [63], which provides a system that gives information about possible

decompositions of the word inflected by many morphemes. Using these possible

decompositions, the root word could be extracted for those words which were

unrecognized by rule based morphological analyzer. The experiment is conducted on

a Telugu text corpus from CIIL Mysore and the improvement in the performance is

checked by the rule based morphological analyzer developed by a LTRC group, IIIT

and HCU, Hyderabad. The observed increase in performance of rule based is from

77% to 84.2% for words which are in the hundreds. It can still be improved if the

corpus is increased.

Unsupervised Learning of the Morphology of a Natural Language [56] reports

the results of using minimum description length (MDL) analysis to model

unsupervised learning of the morphological segmentation of European languages,

using corpora ranging in size from 5,000 words to 500,000 words. They developed a

set of heuristics that rapidly develop a probabilistic morphological grammar, and use

MDL as the primary tool to determine whether the modifications proposed by the

heuristics will be adopted or not. The resulting grammar matches well the analysis

that would be developed by a human morphologist. In the final section, they discuss

the relationship of this style of MDL grammatical analysis to the notion of evaluation

metric in early generative grammar.

Statistical Stemming and Back off Translation [64] describes a cross-language

information retrieval architecture based on balanced document translation. A four-

stage back off strategy for improving the coverage of dictionary-based translation

techniques are then introduced and an implementation based on automatically trained


130

statistical stemming is presented. Results indicate that competitive performance can

be achieved using a four - stage back off translation in conjunction with freely

available bilingual dictionaries, but the usefulness of the statistical stemming

algorithms were tried considerably across the three languages to which they were

applied.

To build a Cross-Lingual Information Retrieval (CLIR) system as a part of the

Indian language sub-task of the main Adhoc monolingual and bilingual track, the task

required retrieval of relevant documents from an English corpus in response to a

query expressed in different Indian languages including Hindi, Tamil, Telugu,

Bengali and Marathi. Groups participating in this track were required to submit an

English to English monolingual run and a Hindi to English bilingual run with optional

runs in the rest of the languages. Their submission consisted of a monolingual English

run and a Hindi to English cross-lingual run. A cross - Lingual Information Retrieval

System for Indian Languages [65] used a word alignment table that was taught by a

Statistical Machine Translation (SMT) system trained on aligned parallel sentences, to

map a query in the source language into an equivalent query in the language of the

document collection. The relevant documents are then retrieved using a Language

Modeling based retrieval algorithm. On the CLEF 2007 data set, this official cross-

lingual performance was 54.4% of the monolingual performance and in the post

submission experiments, they found that it can be significantly improved up to 76.3%.

To undertake further study of the relative merit of various search engines

when exploring the Hungarian and Bulgarian documents and to evaluate these

solutions, various effective IR models are used. Experiments on Stemming

Approaches for East European Languages [66] generally show that for the Bulgarian
131

language, removing certain frequently used derivational suffixes may improve mean

average precision. For the Hungarian corpus, applying an automatic decompounding

procedure improves the MAP. For the Czech language a comparison of a light and a

more aggressive stemmer to remove both inflectional and some derivational suffixes,

reveal only small performance differences. For this language only, performance

differences between a word-based or a 4-gram indexing strategy is also rather small.

Instead of using a completely unsupervised approach, a lightweight stemmer

for Gujarati using a hybrid approach [60] harnessed the linguistic knowledge in the

form of a handcrafted Gujarati suffix list in order to improve the quality of the stems

and suffixes learnt during the training phase. They used the EMILLE corpus for

training and evaluating the stemmer’s performance. The use of handcrafted suffixes

boosted the accuracy of this stemmer by about 17% and helped to achieve an accuracy

of 67.86 %.

YASS: Yet Another Suffix Stripper [68] describes a clustering-based approach

to discover equivalence classes of root words and their morphological variants. A set

of string distance measures are defined, and the lexicon for a given text collection is

clustered using the distance measures to identify these equivalence classes. This

approach is compared with Porter’s and Lovin’s stemmers on the AP and the WSJ sub

collections of the Tipster datasets using 200 queries. Its performance is comparable to

that of Porter’s and Lovin’s stemmers, both in terms of average precision and the total

number of relevant documents retrieved. This stemming algorithm also provides

consistent improvements in retrieval performance for French and Bengali, which are

currently resource-poor.
132

4.3 PROPOSED PARADIGM

The proposed paradigm is moderately in Goldsmith's approach [56]. It is

based on taking all splits of each and every word from the corpus. Goldsmith's

heuristic does not require any linguistic knowledge thus it is completely unsupervised.

For unsupervised studying, words from Telugu corpora are considered. To refine the

learned suffixes normalization heuristic is applied. Each word from the corpus is

being considered in l-1 different ways, splitting the word into stem+suffix after i

letters, where 1<=i<=l-1 and l is length of the word. Then the frequencies of the stems

and suffixes of each split of words are computed. These frequencies are used to get an

optimal split of the word. The optimal segment unifies stem and suffix. Making

analogous signatures and removal of spurious signatures are applied to get regular

suffixes. Figure 4.2 shows the layout of the proposed paradigm. The details of these

steps are described below:

4.3.1 WORD SEGMENTATION

Telugu corpus is given as input to this process. The corpus consists of unique

words. 129066 words are considered as corpus. Take-all-splits heuristic is used for

word segmentation and it uses all cuts of a word of length l into stem+suffix w1,i +

wi+1,l, where w refers to the word and 1≤ i < l. This heuristic assigns a value to each

split of the word w of length l: w1, i +wi+1, l. This heuristic takes all splits of a word wl

into stems and suffixes as:

{ }
133

where stem1i refers to prefix of word wl with length i and suffixi+1l refers to the

suffix of the same word with length l-i. For example, the word ‘అధికారము’ is split

into following stems and suffixes:

అ ధికారము అధ ికారము అధి కారము


{ }
అధిక ిారము అధికా రము అధికార ము అధికారమ ి

Trained Telugu
corpus

Word segmentation
with Take-All-Splits Heuristic

Procuring optimal split to produce


generic stems and suffixes

Generating analogous signatures


and removing irregular signatures

Normalization to enhance
proposed paradigm

Removing spurious signatures to


increase no. of effective suffixes

List of powerful
suffixes

Stemming of strange words


with longest suffix matching alg.

Figure 4.2: Proposed paradigm


134

Then, stems and suffixes are stored in the database and heuristic value of each split is

calculated by using the following equation:

The Heuristic value at split i = i log freq (stem = w1,i) + (l - i) log freq (suffix = wi+1,l)

As the value of i changes the heuristic value also changes. Heuristic value

mainly depends upon the frequency of stem and suffix in the corpus. The frequency of

stem and suffix is varying as the length of the stem and the suffix is increasing. The

frequencies of shorter stems and suffixes are very high when compared to the slightly

longer stems and suffixes. Thus the multipliers i (length of the stem) and l-i (length of

the suffix) are introduced in this heuristic in order to compensate for this disparity.

4.3.2 PROCURING OPTIMAL SPLIT

The output of the first heuristic is set of heuristic values for each word. The

split with highest heuristic is treated to be an optimal split of the word. This

segmentation mainly depends upon the frequency distribution of the stems and

suffixes of the word, language and corpus. Some restrictions are considered while

taking optimal split. First, the word length is restricted to a minimum of three. From

this condition, small words like cannot be split. The

second restriction is based on the heuristic value of each split. If any heuristic value of

the split is zero, it is not taken into account. By this restriction lengthy word like

are not considered.


135

The characterization of heuristic values of all splits of word is

shown in Figure 4.3.

40
35
30

Heuristic values
25
20
15
10
5
0
1 2 3 4 5 6 7 8

Index values from 1 to l-1

Figure 4.3: Heuristic values of all splits of the word

The index value is considered on the x-axis and heuristic values are considered

on the y-axis. By looking into the graph, the heuristic values of the word are increased

first and then decreased. The point where the heuristic value starts decreasing is

considered as optimal split of the word. As the stem length increases up to a certain

limit, the frequency of stem starts decreasing. That is the reason why heuristic value

starts decreasing. This step gives the generic stems and generic suffixes of the given

corpus.

4.3.3 ANALOGOUS SIGNATURES

An optimal split is assigned to every word of the corpus into the generic stem

and generic suffix by the above step. After getting generic stems and generic suffixes,
136

the suffixes that are having the same stem are grouped together. This is called stem’s

signature. This structure is shown below:

{ }{ }

Where suffix1, suffix2, suffix3…. are having similar stem, stem1. By

considering Telugu word example, క ుండా, వచ్ ు, తారు suffixes have same stem

అుంతరుంచిపో . By grouping suffixes with the same stem is shown below:

క ుండా
{అుంతరుంచిపో } { వచ్ ు }
తారు

Signature can also be referred to the set of suffixes along with the associated

set of stems. The stems having the same set of suffixes are grouped. Here stem1, stem2

have the same set of suffixes suffix1, suffix2, suffix3 and suffix4. The presence of the

signature structure is shown below:

{ }{ }

Here, at least two members are present in each set and all combinations

present in this structure are available in the corpus and each stem is found with no

other suffix, but the suffix may well appear in other signatures.
137

Consider the Telugu words, అుంతస్ు , అుంతరక్షనౌక and అకాడమీ. This three

words can have same set of suffixes లో, ల and కి. Grouping of set of stems that are

having the same set of suffixes is shown below:

అుంతస్ు లో
{అుంతరక్షనౌక} {ల }
అకాడమీ కి

All signatures that are associated with only one stem and only one suffix are

discarded. These are called irregular signatures. The suffixes that are generated by this

set of signatures can be used directly in stemming of strange words.

4.3.4 NORMALIZATION

It is perceived that among some of the analogous signatures, set of suffixes

having a similar stem involve some common set of prefixes. Thus, to enhance the

performance of the proposed paradigm, another heuristic is applied. This heuristic

normalizes the prefixes of suffixes that are common to the same stem. This is shown

by the following expressions:

suffixi = a1b1suffix1

suffixj = a1b1suffix2

suffixk = a1b1suffix3

suffixl = c1d1e1suffix4

suffixm = c1d1e1suffix5
138

Suffixi, suffixj, suffixk, suffixl and suffixm are having same stem stemi. The

application of this heuristic concatenates stemi with prefixes of suffixes a1b1, c1d1e1

and modifies stem and suffixes. The stems and suffixes after applying normalization

heuristic are stemia1b1, stemic1d1e1, suffix1, suffix2, suffix3, suffix4 and suffix5.

The normalization process by considering the Telugu words is displayed

below:

ిుంగ
ిుంగుం
ిుంగుంలో
ిుంగాన
ిుంగక
తవమని
{అుంతర} తవమున
ము
మున
మునక
ముపై
{ముయొకక}

After applying normalization heuristic to the word ‘అుంతర’, the obtained stems

and suffixes are displayed below:

ిుం
ిుంలో
ిాన
ిక
అుంతరుంగ ని
{అుంతరతవమ} ిన
అుంతరము

నక
పై
{యొకక}
139

The suffixes that are obtained after normalization process are refined suffixes.

This heuristic is applied to control over stemming. These refined suffixes can be used

in stemming of strange words.

4.3.5 REMOVING SPURIOUS SIGNATURES

The output of the above step gives set of signatures. Again all signatures that

are associated with only one stem and only one suffix are discarded. These signatures

are called spurious signatures. To enrich the performance of the proposed paradigm

these signatures are removed. The remaining signatures are called refined signatures

and the suffixes that are dropped from them are called refined suffixes. The refined

suffixes are not quite the suffixes that can be used to establish the morphology. But

they are very good for approximation and useful for stemming process.

Some of the refined suffixes produced by proposed paradigm are shown


below:

ల ిాల ి కనాా కి ి తో దాకా ి న ి న న ుంచి నికి ిానిా

ిుంలో ిుంలోకి పు ము గానే లేద లేని గా దని ిామా ిాము

ిామో ిాయి యోస్ ిుండి దాకా ి న ి నక ి న తుల ిిుంచి

ిుంతల ిుంన ుండి పరుంగా ిుంమీద ిుంలో కుం ిాలన ి ల లకి

లకీ ి లక ి లకూ కైతే ియము ిుంటే ి క ని గానే టుం తన

ిాడు ిారు ని గార చెన చేత చేసన చేసే తుల మున


140

4.3.6 STEMMING OF STRANGE WORDS

The output of unsupervised paradigm is a list of powerful suffixes. Here two

sets of suffixes are available. The first suffix list is dropped before normalization

heuristic. A second suffix list is generated after applying normalization process. Each

suffix approximately seizes some morphological variation. The longest suffix

matching algorithm is used in stemming of strange words. To alleviate stemming of

strange words the suffixes are organized in decreasing order of their lengths. The

longest possible suffix of the strange word that is matched with some suffix in the list

is dropped. This way is called longest suffix matching algorithm. Matching is started

from first suffix to last suffix. Wherever a match is found, a strange word is

segmented into the stem and suffix. Figure 4.4 shows the sample output of the

proposed paradigm.

Figure 4.4: Sample output of the proposed paradigm


141

4.4 EXPERIMENTS

Central to information retrieval is the idea of relevance. The goal of an IR

system is to retrieve all the relevant documents while retrieving as few non-relevant

documents as possible. Retrieval performance is normally evaluated by using the

interrelated measures of precision and recall. Precision is defined to be the ratio of the

number of relevant documents retrieved to the total number of documents retrieved.

Recall is defined to be the ratio of the number of relevant documents retrieved to the

total number of relevant documents.

The ideal IR system would achieve 100% recall and 100% precision for all

queries. However, in practice precision and recall tend to vary inversely. A very

specific query formulation tends to produce high precision. But, it also generally

results in the low recall. Conversely, a broader query formulation is likely to retrieve a

greater pool of documents, but the portion relevant is normally smaller, resulting in

lower precision. Most attempts at improving one variable tend to have a negative

effect on the other.

Calculation of recall requires knowledge of the total number of relevant

documents in the collection. For a small document collection, this is possible. But for

larger collections, determining the number of relevant documents may not be

practical. Recall figures for larger collections are often calculated by estimating the
142

number of relevant documents. Using sampling techniques is one method for doing

this. Another method is to perform a series of searches using various retrieval

techniques. The top results of these searches are combined to produce the relevant

documents. This technique, called pooling, is based on the assumption that the

combination of independent retrieval techniques will retrieve all or most relevant

documents.

Figure 4.5: Typical average precision vs. recall

Rather than determining relevance on a boolean, “yes or no”, answer based on

all the terms appearing in the document, a more flexible approach is to compute the

probability of relevance based on the number or frequency of query terms that appear

in the document. The more terms that occur, the greater the probability that the

document is relevant. A document that contains even one of the terms is viewed as a

potential answer, but documents that contain all or most of the terms will receive the
143

highest ranks. The ranking approach to information retrieval retrieves documents in

decreased ranking order of likely relevance to the user’s query.

The formula that measures the performance a user can expect from a system

can be defined by taking the average over a series of sample queries.

where num is the number of queries

A single measure that trades off precision versus recall is the F- measure,

which is the weighted harmonic mean of precision and recall:

where α [0, 1] and thus β2 [0,∞]. The default balanced F measure equally

weights precision and recall, which means making α = 1/2 or β = 1. It is commonly

written as F1, which is short for Fβ=1, even though the formulation in terms of α more

transparently exhibits the F measure as a weighted harmonic mean. When using β = 1,

the formula on the right simplifies to:


144

However, using an even weighting is not the only choice. Values of β < 1

emphasize precision, while values of β > 1 emphasize recall. For example, a value of

β = 3 or β = 5 might be used if the recall is to be emphasized. Recall, precision, and

the F measure are inherent measures between 0 and 1, but they are also very

commonly written as percentages on a scale between 0 and 100.

Mean Average Precision (MAP), which precision provides a single-figure

measure of quality across recall levels. Among the evaluation measures, MAP has

been shown to have especially good discrimination and stability. For a single

information need, Average Precision is the average of the precision value obtained for

the set of top k documents existing after each relevant document is retrieved, and this

value is then averaged over information needs. That is, if the set of relevant

documents for an information need qj ∈ Q is {d1, . . . dmj} and Rjk is the set of ranked

retrieval results from the top result until user gets to document dk, then

When a relevant document is not retrieved at all, 1 the precision value in the

above equation is taken to be 0. For a single information need, the average precision

approximates the area under the uninterpolated precision-recall curve, and so the

MAP is roughly the average area under the precision-recall curve for a set of queries.
145

There is always an argument for the effectiveness of stemming;[124] reported

that different outcomes and results are possible for a steaming process. [125] reported

that the influence of stemming in overall performance is little if effectiveness is

measured by traditional precision and recall measures. It is true if the language is with

simple morphology like English. But [54] proved that stemming process significantly

improves the retrieval effectiveness, mainly precision, for short queries or for

languages with high complex morphology, like romance language [73].

Mainly the performance of a stemmer is measured in terms of its contribution

to enrich the performance of an IR system. In order to evaluate proposed unsupervised

Telugu stemmer, two different experiments were conducted. The first experiment

evaluates the performance of stemmer in terms of accuracy, precision and recall. In

the second experiment, the performance of stemmer is measured in terms of its

capability to reduce the size of the index in an information retrieval task. For these

two experiments, two sets of suffixes are used. The performance of proposed stemmer

is evaluated and compared before and after normalization heuristic. The training data

is extracted from CIIL Mysore containing 129066 words from 200 documents. Table

4.1 shows an overview of the characteristics of the trained corpus. Two different test

runs are produced by randomly extracted words from Telugu daily newspapers.
146

Table 4.1: Trained Telugu corpus

Description Count

Total documents 200

Unique words 129066

Stem’s signatures before normalization 29501

Unique suffixes before normalization 22541

Regular suffixes before normalization 2746

Stem’s signatures after normalization 42273

Unique suffixes after normalization 14834

Refined suffixes after normalization 1583

Figure 4.6 compares the number of suffixes dropped before and after

normalization heuristic.

3000

2500

2000

1500

1000

500

0
with normalization without normalization

Figure 4.6: Number of suffixes


147

Experiment 1: To measure the performance of this stemmer accuracy, recall,

precision and F-score metrics are considered in the first experiment. Two different

test runs run-1 and run-2 are performed. The evaluation metrics used in this

experiment are described below:

Accuracy of proposed stemmer is defined as the number of words stemmed

correctly. This is calculated by comparing manually stemmed test words and stems of

test data that are produced by proposed stemmer.

Recall is defined as the ratio of length of stem generated by proposed stemmer

and the length of the actual stem that is identified manually. If the length of stem that

is produced by the proposed stemmer is less than the length of the actual stem then the

recall is treated as 100% because stem produced by proposed stemmer is part of the

actual stem. However, these types of words decrease the precision value because the

actual stem is having some extra characters. Mathematically, recall is calculated using

the below formulae:

Recall = 1, if the length of stem produced by proposed paradigm < length of actual

stem


, otherwise

Precision is defined as the ratio of the length of the actual stem that is

identified manually and the length of the stem generated by the proposed stemmer. If

the length of stem that is produced by proposed stemmer is greater than the length of

the actual stem then precision is treated as 100% because all the characters present in
148

actual stem are also present in the stem that is produced by proposed stemmer.

Mathematically, precision is calculated using below formulae:

Precision = 1, if the length of the stem produced by proposed paradigm > length of

actual stem


, otherwise

where n is the total number of test words.

The harmonic mean of recall and precision is defined as an F - score.

Mathematically F-score is represented using the below formula:

( )

The two test runs of the proposed stemmer before and after normalization are

shown in Table 4.2 and Table 4.3 respectively.

Table 4.2: Test results before normalization

Test Data Accuracy Precision Recall F-score

Run-1 34.80 82.90 96.92 89.37

Run-2 50.20 88.52 95.11 91.70


149

Table 4.3:Test results after normalization

Test Data Accuracy Precision Recall F-score

Run-1 69.89 93.94 91.97 92.94

Run-2 85.40 96.91 89.14 92.86

Figure 4.7 shows the comparison graph of accuracy, precision, recall and F-

score with and without normalization heuristic.

100
80
60
40
20
0
Accuracy Precision Recall F-score

with normalization without normalization

Figure 4.7: Performance comparison with & without normalization

Experiment 2: The performance of the proposed paradigm in terms of

reduction in index size of the Telugu information retrieval is measured in this

experiment. For this experiment document from the Telugu corpus is taken. The

difference between the number of index terms with and without stemming is

considered as a reduction in index size. Table 4.4 describes the comparison of

percentage reduction in index size of the proposed stemmer before normalization and

after applying normalization heuristic.


150

Table 4.4: Percentage reduction in index size

Before
No. of words After normalization
Test parameter normalization
without stemming heuristic
heuristic

No. of words Reference 77.14% 67.24%

Index size(MB) Reference 83.68% 76.05%

The comparison of index size in terms of number of words with and without

normalization heuristic is shown in Figure 4.4.

500

400

300

200

100

0
reference without with
normalization normalization

Figure 4.8: Index size comparison

4.5 RESULTS AND DISCUSSION

Table 4.2 describes that maximum accuracy is found to be 50.20% and F-score

is found to be 91.70% in run-2. Table 4.3 shows that the maximum accuracy is

85.40% in run-2 and F-score is 92.94% obtained after applying normalization

heuristic in run-1. By seeing all test runs in terms of accuracy, precision and F-score
151

proposed stemmer performs better after normalization heuristic. Recall obtained

before applying normalization heuristic is more than that obtained after applying

normalization heuristic. The recall and precision values in two test runs indicate that

over stemming is the main cause of the difference in accuracy. This accomplishment

is achieved with a large set of refined powerful suffixes.

The result of the reduction in the index size is shown in Table 4.4. The index

size without stemming is considered as a baseline. According to the number of index

terms the reduction is found to be 77.14% before applying normalization and 67.24%

after applying normalization. By considering the index size (measured in MB), the

percentage reductions are found to be 83.68% and 76.05% before applying

normalization and after applying normalization respectively.

4.6 SUMMARY

The results show that the proposed paradigm outperforms after applying

normalization heuristic. Proposed stemmer is unsupervised and language independent.

Corpus is used to derive set of powerful suffixes and it does not need any linguistic

knowledge. Hence, this approach can be used for developing stemmers for other

languages that are morphologically rich.

The performance of the proposed paradigm is evaluated in terms of accuracy,

precision, recall and F-score and compared before and after applying normalization

heuristic. Ability to reduce index size of the information retrieval task is also

measured by proposed stemmer before and after applying normalization. This shows
152

that percentage reduction in the index size is better for the proposed stemmer before

normalization.

The most widely accepted measure to evaluate the performance of stemmer by

using the IR system is Mean Average Precision (MAP) at overall relevant retrieved

documents over a fraction of retrieved documents. The effectiveness of proposed

stemmer is not evaluated in terms of MAP due to unavailability of the IR system and

other resources for Telugu language. This work will be the future scope.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy