ICON-2013 Submission 36

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Word Sense Disambiguation in Bengali applied to Bengali-Hindi Machine

Translation
Ayan Das Sudeshna Sarkar
Dept. of Computer Science and Engg. Dept. of Computer Science and Engg.
Indian Institute of Technology Indian Institute of Technology
Kharagpur, India Kharagpur, India
ayandas84@gmail.com shudeshna@gmail.com

Abstract a cross-lingual WSD system for Bengali to Hindi ma-


chine translation.
We have developed a word sense disambigua- The supervised methods of WSD perform much better
tion(WSD) system for Bengali language and than the unsupervised systems. However, the super-
applied the system to get correct lexical choice vised systems rely heavily on knowledge and require
in Bengali-Hindi machine translation. We are large volume of sense annotated corpus. Such corpus
not aware of any existing system for Bengali is not currently available for Bengali. Given these re-
WSD. Since there is no sense annotated Ben- source constraints, we needed to develop a WSD sys-
gali corpus or sufficient amount of parallel cor- tem for the Bengali language. Since it is expensive
pus for Bengali-Hindi language pair, we had to and time-consuming to develop sense annotated cor-
use an unsupervised approach. We use a graph pus, we decided to develop an unsupervised WSD sys-
based method to find sense clusters in Ben- tem. The resources that we had are unannotated Ben-
gali language. Following this we use a vector gali and Hindi News corpora, a Bengali sense dictio-
space based approach to map these sense clus- nary and Hindi WordNet (Chakrabarti et al., 2002).
ters to Hindi translations of the target word and We use an unsupervised graph-based clustering ap-
to predict translation of the target word in test proach for sense clustering and compare our method
instance. We used monolingual Bengali and with two existing graph-based approaches, (Navigli
Hindi corpora and the available Bengali and and Crisafulli, 2010) and (Jurgens, 2011) for sense
Hindi wordnet and bilingual sense dictionary. clustering.
Most work that use WSD to improve translation is
1 Introduction based on bilingual parallel corpora. Since Bengali-
Hindi parallel corpus is not available to us, we propose
Word Sense Disambiguation (WSD) is defined as the an approach for prediction of correct translation of pol-
task of finding the correct sense of a word in a con- ysemous Bengali words in a given context, in Bengali
text, when the word has multiple meanings. The iden- to Hindi machine translation using a vector space based
tifiation of the correct sense of a word in a context is model.
useful for many applications including machine trans-
The rest of the paper is organized as follows. We dis-
lation, information extraction and anaphora resolution.
cuss some of the related works in section 2. In Sections
The various methods for word sense disambigua-
3 and 4 we discuss our objective and some state-of-the-
tion can be broadly classified as knowledge/dictionary
art works on graph-based WSD respectively. In Sec-
based, supervised, semi-supervised and unsupervised
tions 5 and 6 we discuss our works in details. Finally,
approaches. However, there is no strict boundary be-
in Section 7 we give a detailed analysis of our results
tween these methods and combinations of these some-
and Section 8 concludes the paper.
times yield good results.
Our aim was to improve Bengali to Hindi machine 2 Related Work
translation by incorporating word sense disambigua-
tion. Specifically, we started with a basic Bengali- Unsupervised methods have the potential to overcome
Hindi rule based machine translation system where the the lack of large-scale corpus which has been manu-
polysemous Bengali words are replaced by most fre- ally annotated with word senses(Pedersen and Bruce,
quent translation as learnt from some training data. We 1997). The unsupervised approaches can be broadly
are required to find the sense of the word from contex- classifed into three subcategories: context clustering,
tual cues and suggest a suitable translation for the word word clustering and graph-based approaches. The
in Hindi. graph based methods have been quite popular since the
Although there is a lot of work in word sense dis- results are quite promising.
ambiguation and its application in machine translation, Schütze proposed an unsupervised approach based
there is little work in Indian languages. To the best on the notion of context clustering (Schütze, 1998).
of our knowledge this is the first attempt to develop The approach is based on the vector space model. Each
word is represented as a vector based on cooccurrence. from health and tourism domains and used WordNet
The word vectors are clustered into groups based on based features such as conceptual distance, semantic
cosine similarity and each group is considered as iden- graph distance, belongingness to dominant concept,
tifying a sense of the target word. sense distribution and corpus co-occurrence as corpus
Lin (1998) proposed another method based on clus- features.
tering semantically similar words. The construction of Khapra et al. (2011b; Khapra et al. (2011a; Mitesh
a cooccurrence graph based on grammatical relations M. Khapra and Sharma (2008) attempted a domain
between words in context was described by Widdows specific bilingual WSD using bilingual bootstrapping.
and Dorow (2002). The adjacency matrix of the graph A persistent problem was that, sometimes words
is interpreted as a Markov chain and Markov clustering onto which the parameters were projected(Mitesh
algorithm is used to identify the word senses. Véronis M. Khapra and Bhattacharyya, 2009), were themselves
(2004) proposed a graph based approach called Hyper- polysemous. An Expectation-Maximization(EM) ap-
Lex. Some of the recent work on unsupervised graph- proach was used to settle on fixed parameter values by
based word sense disambiguation are that of Navigli parameter projection from one language to another at
and Crisafulli (2010) and Jurgens (2011). Navigli and each alternate iteration of the EM approach.
Crisafulli attempts to identify connected components
in the local graph, built from the contexts retrieved for 3 Objective
ambiguous query word. The graph-based algorithm ex- Most work on application of WSD to machine trans-
ploits cycles of size 3 and 4 in the co-occurrence graph lation is based on statistical machine translation(SMT)
of the input query to detect the query words meanings. framework and use parallel corpus. Our work aims to
Jurgens (2011) aggolmeratively clustered the edges use WSD in rule-based machine translation system for
of the cooccurrence graph constructed from the corpus, resource poor language pairs where parallel corpus is
based on a similarity score between the edges using not available. From the literature, we observed that
single-link criteria. The dendrogram was cut at a level the supervised methods perform much better than the
at which, a function of the edge density of the clusters unsupervised methods. However, supervised methods
is maximum. Lee and Ng (2002) and Voorhees (1993) require large volume of sense annotated corpus. Due
developed an unsupervised WSD algorithm that uses to lack of such sense annotated corpus in Bengali and
with other knowledge resources like dictionary. time and effort involved in doing this we decided to use
As already mentioned, a major application of WSD some unsupervised method for WSD and unsupervised
is the improvement of performance of machine transla- method to map sense clusters in source language to ap-
tion systems. Brown et al. (1991) developed a method propriate translation in target language.
to predict the translation of an ambiguous French word
in English based on the assumption that an ambiguous 4 Some existing approaches to WSD
word in French may be translated into different En- implemented for Bengali
glish words depending on the sense of use. Apidianaki Graph-based approaches have been quite successful in
(2009) reports an unsupervised method and lexical se- unsupervised word sense disambiguation, so we de-
lection based approach that exploit the results of a data- cided to work on graph-based WSD system for Ben-
driven sense induction from parallel corpora for cross- gali. We have studied the performance of two suc-
lingual WSD in English-Greek and Greek-to-English cessful WSD methods suggested by Navigli and Crisa-
machine translation. Chen et al. (2010) proposed a fulli(2010) and Jurgens(2011) in Bengali. These are
vector space model based algorithm to compute sense reported in 4.1 and 4.2 respectively.
similarity between lexical units ( words, phrases, rules,
etc.) and use it in statistical machine translation. Vintar 4.1 Navigli and Crisafulli’s approach
et al. (2012) reported a series of experiments performed In this approach (Navigli and Crisafulli, 2010), the cor-
for English-Slovene language pair using UKB, a freely pus is queried with a target word and a cooccurrence
available graph-based WSD system. Apidianaki et al. graph is constructed from the contexts after removal
(2012) proposed a method of integrating semantic in- of the target word from the contexts. The main idea
formation at two stages of translation process of a SMT behind this approach is that edges in the cooccurrence
system. graph participating in cycles are likely to connect ver-
In Indian Languages, Khapra et. al. introduced tices (i.e. words) belonging to the same meaning com-
an algorithm iWSD (iterative WSD)(Khapra et al., ponent. The work focuses on cycles of length 3 (tri-
2011b),(Khapra et al., 2009),(Mitesh M. Khapra and angle) and 4 (square). The edge weights are equal to
Bhattacharyya, 2010) which uses the WordNet. The the Dice-coefficient of cooccurrence of two words in
monosemous words are initially tagged with their the retrieved context set and edges with weight below a
meaning and used as seed set. The senses of words threshold are removed. Each of the remaining edges are
in a sentence are then resolved iteratively ordered by assigned weight equal to the triangle and square scores
increasing degree of polysemy. Mitesh M. Khapra and and edges with score below a threshold value are re-
Bhattacharyya (2010) applied iWSD algorithm to data moved. Finally, all the connected components of size
greater than a threshold are identified and each such which the words(vertices) are more strongly connected
component is assumed to contain words that together amongst themselves than with the vertices outside the
indicate a distinct sense of the target word. community. Our intuition was that, a set of words that
On implementing this method in Bengali, we ob- cooccur very strongly tend to form a clique or a very
served that the value of the cut-off (triangle/square) dense subgraph within the cooccurrence graph where
scores in the Navigli and Crisafulli (2010) approach edge density within these subgraphs is much higher
varies widely for different target words depending on than the edge density between any two such clusters
the volume of data retrieved as a result of querying the i.e., the internal edge density of such a community
corpus. Secondly, the connected components obtained should be greater than its external edge density. We
by this method are non-overlapping. Hence, any word can map this concept to WSD as, if a word is a stronger
can occur in at most a single cluster. However, some indicator of a particular sense of a target ambiguous
words may be indicators of multiple senses of the tar- word then it should have more number of neighbors in
get polysemous word. Although the overall system per- a particular community than other communities and ad-
forms better with square score, we observed that for dition of that word into the community is expected to
some words triangle score performs better. increase the fitness function, discussed in 5.2, value of
the community. Based on these observations, we pro-
4.2 Jurgens’ approach pose a community detection based approach.
Jurgens (2011) proposed a community detection algo- 5.2 Overview
rithm from a cooccurrence graph constructed from the
nouns in the corpus that occur with frequency greater In our approach, we treat WSD as a community de-
than a threshold. Initially, similarity between each edge tection problem. We extract the contexts containing a
pair is computed by a scoring function which equals target word from the Bengali corpus and build a cooc-
zero if the edges do not share any vertex and is the ra- currence graph. In this cooccurrence graph, a commu-
tio of the number of common neighbors and the total nity detection algorithm is used to detect communities.
number of neighbors of the two vertices apart from the In order to extract the sense-specific information cor-
common vertex. Finally, the edges are aggolmeratively responding to a target word we hypothesize that each
clustered by single-link criteria. The construction of community in the co-occurrence graph provides con-
the dendrogram is stopped when the sum of the edge textual information that indicates a single sense of the
density in the clusters is highest. target word. In the co-occurrence graph, each vertex
We observed that the graph becomes excessively represents a word in the corpus and an edge exists be-
large size when the whole corpus was used for the tween two vertices if the two words co-occur in a sen-
graph construction. It consumes lot of memory and the tence (a context). The weight of an edge is equal to the
clustering process becomes extremely slow and time- number of contexts in which the two words co-occur.
consuming. To make the algorithm more scalable, we We studied some of the existing community detection
introduced a modification to this approach. Keeping algorithms and decided to use Greedy Clique expan-
the algorithm and the scoring function the same we ex- sion algorithm(GCE)(Lee et al., 2010). The summary
ecuted the algorithm on the cooccurrence graph con- of the steps of the approach for word clustering is given
structed from the contexts retrieved by querying the in Algorithm 1:
corpus with the target word and removing the target
word from the contexts, instead of building the graph We now describe Lee et al. (2010)’s algorithm. The
from the whole corpus. community detection algorithm is fundamental to find-
We observed that some clusters are highly overlap- ing the contextual information for sense induction. The
ping and the number of clusters formed for a target input to this algorithm are the graph and four parame-
word are very large. Moreover, it is not possible to ters : the minimum clique size (k), the scaling param-
control the degree of overlap between the clusters. eter of the fitness function α, minimum overlap degree
of initial cliques and minimum overlap degree of final
5 Our Approach to WSD based on communities (). We used the same fitness function
community detection and the seed detection algorithm as used in Lee et al.
(2010). We use the notation used in Lee et al. (2010) to
5.1 Motivation explain the the main steps of the algorithm.
As discussed in sections 4.1 and 4.2, the approaches The community fitness function is a measure of the
have certain disadvantages. Our aim was to find a degree to which a induced subgraph S of the graph G
method that addresses these issues. We wish to find corresponds to the notion of community. It takes an in-
an algorithm that generates relatively less number of duced subgraph S of the graph G as input and returns a
clusters and at the same time the degree of overlap real-valued fitness value as output. The fitness function
among the clusters should not be very high. We decided of a community S is defined by the in terms of S 0 s in-
s s s
to use an algorithm that focusses on the edge-density ternal degree kin and external degree kout . kin is equal
of a graph and identifies clusters within the graph in to twice the number of edges that both start and end in
Glosses of Bengali words on which we tested our system
Word Sense 1 Sense 2 Sense 3 Sense 4
আচার(achar) Ritual Pickle
অথ(artha) Money Meaning
চাল(chaal) Rice Maneuver Roof
ডাল(daal) Branch of a tree Pulses
গালা(gola) Shell/Cannon ball Place to store harvest
জাল(jaal) Forge Net Trap Network
ক (kendra) Center regarding Central Government Place intended for some purpose
ল (lakshya) Aim Purpose Observation
ণালী(pranaali) Recipe Strait
রা া(raasta) Road Method Option

Table 1: Glosses of Bengali words on which we tested our system

Algorithm 1: Steps followed in word clustering The algorithm proceeds as follows;


input : Bengali Corpus, query word 1. Find the seeds (maximal cliques of size atleast
output: Bengali word clusters with respect to the equal to k) in the graph.
query word
2. Choose the largest unexpanded seed, create a can-
1 Query the Bengali corpus with the target word to didate community C 0 and continue expanding the
retireve the sentences containing the target word; seed with a community fitness function F until ad-
2 From the retrieved sentences, remove the target dition of any node would lower fitness.
word and select the nouns with frequency above a
threshold value; 3. If C 0 is within  of any already accepted commu-
3 Build the co-occurrence graph with the words nity C, then C 0 is discarded as a near duplicate of
selected in the previous step where edge weight C. Otherwise, if no near duplicates are found, C 0
between two nodes is the number of times the two is accepted.
words occur together in a sentence;
4. Continue to loop back to step 2 until no seed is
4 Set a threshold for the edge weight and remove the
remaining.
edges from the graph with weight below that
threshold; 6 Going from WSD to word selection for
5 Perform community detection on the graph. We
Bengali to Hindi translation
use GCE (Lee et al., 2010) implementation from
https://sites.google.com/site/greedycliqueexpansion; In Section 5 we looked at methods for finding sense
clusters in Bengali. In phase 2, we find a mapping to a
Hindi word for each sense cluster in Bengali and pre-
dict the translation of a polysemous Bengali word in
s
S and kout is the number of edges that have only one Hindi in a given context.
edge in S. The community fitness is defined as: The second phase proceeds in two stages. First, the
s
clusters obtained from the cooccurrence graphs using
kin the algorithms described so far are mapped to some
Fs = s s )α
(kin + kout Hindi translation of the target Bengali word and sec-
where α is the parameter that can be tuned. ond, these tagged clusters in turn are used to suggest a
Let S be an induced subgraph of graph G which is suitable translation of occurrence of the target word in
a seed or core of a community C. In other words, S is a test context.
embedded in some larger community C, such that all From the literature, we observed that most of the
of its nodes are part of C, but not all nodes in C are works on application of WSD for improving machine
included in S. S has to be expanded by adding nodes translation use parallel corpora or wordnet relations to
to it until it includes all nodes in C. The technique is identify suitable translation of a polysemous word to
summarized as follows: the target language. Since we do not have Bengali-
Hindi parallel corpora, we propose a vector space
• For each node v in the boundary of the seed com- based approach for using the clusters and a comparable
munity S, the extent to which inclusion of v in- Bengali-Hindi corpora for predicting the correct trans-
creases or decreases the fitness of S, is computed. lation of a polysemous Bengali word to Hindi. The
• The node with highest fitness value, vmax , is se- steps are described below.
lected. 6.1 Step 1: Construct the reference vectors
• If vmax has a positive fitness value, then add it to For a Bengali target word, the identifiers of all the con-
S and loop back to step 1. Else, stop and return S. cepts (synids) in which the word occurs in the Bengali
sense dictionary are extracted. There exists an one-to- Algorithm 3: Prediction of translation of target
one mapping between the concept ids in Bengali sense word using the clusters
dictionary and Hindi wordnet. This feature is utilized
to get the synsets from the Hindi wordnet correspond- input : Target word, Test sentence, Bengali word
ing to the sense ids so obtained. All the words in the clusters tagged with corresponding Hindi
Hindi synsets are combined to form a set of unique word
Hindi words, which is expected to be the set of all pos- output: Translation of the target word in the test
sible translations of the target Bengali word in Hindi. context
For every Hindi word in the set, the sentences contain- 1 Convert each word in the new test context into
ing that word are extracted from the Hindi corpus. All their root forms;
such set of sentences are combined to form a bag of 2 Extract only nouns from the context;
words. A set of all words that occur in the bag of words 3 Generate vector space such that each of the unique
is used to define a vector space where each Hindi word words occurring in any of the Bengali clusters
corresponds to a dimension in the vector space. For obtained for a target Bengali word is a dimension;
each word w in the set of possible Hindi translations 4 Project the clusters onto the vector space to form
of the target word, we define a reference vector such reference vectors;
that the magnitude in any direction is equal to the tf-idf 5 Project the test instances onto the vector space to
score of the word corresponding to that dimenson in the form the test vectors;
contexts containing w. 6 Find the cluster that has the highest cosine
similarity with a test vector;
6.2 Step 2: Labeling the clusters with Hindi 7 Return the label of the selected cluster as the Hindi
candidate words translation of the target Bengali word in the test
The steps in tagging the Bengali sense clusters in Hindi context;
are given in Algorithm 2:

Algorithm 2: Labelling the clusters with Hindi obtained from paddy seeds] and 6132 - পঁ াচ, কৗশল,
words চাল(pyanch,koushal,chaal)[tactics,maneuver,move] in
input : Word clusters corresponding to the the Bengali sense dictionary. When the Hindi word-
Bengali query word, Bengali-Hindi net is queried with these two synids it returns the
bilingual dictionary synsets that contains the synsets containing the words
output: Assignment of a Hindi word to each चावल(chaval)[rice] and चाल(chaal)[tactics] respectively.
cluster Thus, these two words are considered as the possible
translations of the Bengali word.
1 Translate the Bengali clusters obtained from the
community detection algorithm to Hindi using a A mapping is available between the synset/concept
bilingual dictionary; identifiers in Hindi Wordnet and Bengali sense dictio-
2 Project each cluster onto the vector space defined nary. This feature help us to automate the process of
in 6.1; finding the set of all possible translations of a Bengali
3 find the cosine similarity of each cluster with the word in Hindi. Given a Bengali word we first searched
reference vectors; the Bengali sense dictionary for synset ids of all the
4 Label each cluster with the Hindi translation synsets which contain the word. We used these sense
whose reference has the highest cosine similarity ids so obtained, to query the Hindi Wordnet for all the
with the cluster; corresponding synsets. The set of all the unique Hindi
words contained in Hindi synsets corresponding to the
synset ids are considered to be the possible Hindi trans-
lations of the target Bengali word.
6.3 Step 3: Prediction of translation of target If the mapping between the two wordnets were not
word using the clusters available, one would need to do alignment of the two
The steps involved in prediction of Hindi translation synsets for finding the set of Hindi translations of the
of target Bengali word from the clusters are listed in Bengali words and vice versa.
Algorithm 3: When the Hindi Corpus is searched with the two
words we get the following sentences. भारत में इस साल
6.4 An example of vector space model based चावल का उत्पादन कम है(Bharat mein is sal chaval ka
approach utpadana kama hai)[The yield of rice is less in India
We give a small example to show, how the use this year], चावल से काफी सारे पकवान बनते है(Chavala
of vector space approach works. Let the target se kafi sare pakavan bante hai) [A lot of dishes are pre-
word be "{চাল}"(chaal). The word is contained pared from rice] and मंत्री का चाल जनता समझ रहा
in the synsets 6303 - ধােনর বীজ থেক া খাদ है (Mantri ka chal janata samajh raha hai)[The public
শ (dhaner bij theke prapto khadya shasya)[food grain is able to look through the political maneuver of the
minister] and let the Bengali test sentence be এই বছর of test contexts. The postprocessing step was executed
চাল উ পাদন বড় ভােলা হেয়েছ(Ei bochor chaal utpadon on the entire test data since we had to ensure that the
boro valo hoyechhe).[The yield of rice is very good this corrective step assigns correct translation to the mis-
year] Translation of the test instance to Hindi is इस साल classified instances and leaves the correctly classified
चावल उत्पादन बड़ा अच्छा हुआ(Is sala chaval ka utpadana instances unchanged.
bada achha hua). The target words for which rules were defined and
The Hindi vectors are as given in the Table 3. the corresponding rules are given below.
The reference vectors 1 and 2 are combined to অথ (artha)
form the vector for चावल(chaval)[rice] and the vec- Some of the most frequent words that co-occur
tor 3 is the reference vector forचाल(chaal)[tactics] as with the word or bear similar meaning when used
shown in Table 4. The cosine similarity of the test in the "money" sense are : টাকা, ব া , পিরমাণ,
vector turns out to be greater with reference vec- ঋণ, কিমিট, বােজট, , সময়, ক , ম ক, লি ,
tor for चावल(chaval)[rice]. Hence, the translation বািণজ etc.(taka, bank, pariman, rin, committee,
चावल(chaval)[rice] is predicted for চাল(chaal) in the budget, kshetra,samay,kendra,mantrak,lagni,banijya
given test instance. etc.)[money, bank, quantity, debt, budget, committee,
area, time, center, ministry, investment, trade]
However, when the word is used to indicate the
6.5 Analysis of results sense meaning it is difficult to identify the sense from
We analyzed the data and the results so obtained and contextual cues. We found that there is no specific
found that the definite senses of most of the ambigu- set of words that could be used to identify this sense.
ous words could be identified from the contextual in- This is due the wide variation and low frequency of the
formation to a significantly high degree of accuracy. words that co-occur with the target word when used in
However, the accuracy of identification of the abstract the sense meaning in any context e.g.,
senses of most of the words was quite low. For exam- (1) উ েত মািলকা নােমর অথ রািন .(Urdute malika
ple, the word অথ (artha) in Bengali has the glosses namer artha rani)[The Urdu word malika means queen]
money and meaning, and the word ক (kendra) (2) সার েয়াগ না করার অথ ফলন কমা .(sar proyog
has the glosses center,regarding, central government na korar artha folon koma)[Not using fertilizer implies
and Place intended for some purpose (health centre reduction in yield]
or powerplant) etc. It was observed that the sense However, we found that it follows a general pattern
meaning of the word অথ (artha) and the sense re- of "meaning of something" i.e., possessive sense. The
garding of the word ক (kendra) could not be iden- word preceding the target word has a -র(-r) inflection
tified from the contextual information and the perfor- attached to the root word and in some cases the qual-
mance was poor for these words.But we found that ifying words কােনা(kono)[any], কানও(konoo)[any],
these senses can be distingushed by looking at their গভীর(gaveer)[deep], নানান(nanan)[multiple],
context patterns. There exists definite patterns such নানািবধ(nanabidho)[multiple] preceding অথ (artha)
as, certain patterns of verbs and inflectional forms of qualifies it. In such cases, we have to look for pos-
the words that occur in the neighborhood of the target sessive inflection in nouns preceding these qualifying
word in a given contexts, can be used for identification words. Hence, we define the rules as follows.
of these senses for many of the abstract concepts. (1) Check the word preceding the word অথ (artha), if
The results of the first phase of the experiment (i.e., it is contained in the set of qualifying words then look
the prediction of translation from the clusters) gives for the word preceding the qualifying word. Else take
mixed results. The poor performance of the system for the word preceding the qualifying word.
some words is either due to insufficient training data for (2) If the selected word has a -র(-r) inflection attached
a word or because the word has some abstract senses to the root word then tag the word অথ (artha) as
that cannot be captured by the contextual nouns. In or- मतलब(matlab)[meaning].
der to alleviate the problem due to abstract senses we
introduce the rule-based post-processing step, which ল (lakshya)
we discuss in section 6.6. We found that the distribution of a specific set of
words in the neighborhood of the word ল (lakshya),
when used in the the first two senses is rather consistent
but when used in the sense “observation" it suffers
from the same problem as that of the word অথ (artha).
6.6 Rule-based correction গণতে র মাধ েম দেশর িনপীিড়ত জনগেণর শাষণ
We identified some of these patterns from the data and মুি ঘটােনাই পািটর ল ।(Bahudaliyo ganotantrer
defined some rules based on these patterns. These rules madhyame desher nipirito janoganer shoshan mukti
were encoded into a program which was executed as a ghotanoi partir lakshya)[The main purpose of the
post-processing step to reassign the Hindi translation party is the liberation of the oppressed people through
to the instances of target Bengali word in a given set democracy] (Purpose)
Accuracy of prediction of Hindi translation of Bengali words by unsupervised graph-based approaches
GCE community
Word Navigli’s approach(%) Jurgens’ approach(%)
detection algorithm(%)
আচার(achar) 46 71 86
অথ(artha) 54 35 48
চাল(chaal) 36 51 54
ডাল(daal) 82 66 87
গালা(gola) 68 37.5 59
জাল(jaal) 96 47.3 54
ক (kendra) 22 22 47.8
ল (lakshya) 16 19.6 41
ণালী(pranaali) 75 30 71
রা া(raasta) 43 44 51.5
Average Accuracy 53.8 42.34 60

Table 2: Results of prediction of Hindi translation of Bengali words by unsupervised graph-based approaches

Tokens Bharat mein yaha sal chaval ka utpadana kama hai se kafi sara pakavan bana mantri chal janata samajh raha

Ref 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

Ref 2 0 0 0 0 1 0 0 0 1 1 1 1 1 1 0 0 0 0 0

Ref 3 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 1 1 1 1

Test Bilingual Dict 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0

Table 3: Sentences projected onto vector space

ৃতীেদর ল বা া এক সময় ােমাটাির করেতন “observe".


।(Dushkritider lakshya Bappa ek samay promotari
korten)[Bappa, who has been the target of the anti- We define the rules as; Get the word immediately after
socials, was involved in real estate business] (Target) the target word. Get its inflectional and root form. If
অ◌াতাপ ুর ােমর বািস া কমল সদার থম তালতলা the words follow the patterns described then translate
the word to Hindi word देख(dekh).
ঘােটর কােছ নদীবঁােধ ফাটল ল কেরন ।(Aatapur
চাল (chaal)
gramer basinda Kamal Sardar prothom taltola ghater
We observed that in the corpus the “Rice" sense is
kache nodibandhe fatol lakshya koren)[Kamal Sardar,
the dominant sense for the word চাল(chaal). Some of
a resident of Atapur village, first say a breach in the
the most frequent words that co-occur with the word
dam near the river bank at Taltala] (Observe)
or bear similar meaning when used in the rice sense
The patterns identified in this case are as follows; are : রশন, জলা, খাদ , দর, বরা , িবি , কু ই াল,
If the word succeeding ল (lakshya) has the root দফতর, টন, দাম, সং হ, িকেলা, িকেলা াম etc.(ration,
রা (rakh) with any inflection other than -এ(-e) then it jela, khadya, dor, boraddo, bikri, quintal, doftor, ton,
implies the sense ”observation". daam, songroho, kilo, kilogram etc.)[ration, district,
Similarly, if the word following ল (lakshya) has food, price, allotment, sell, quintal, office, ton, price,
the root কর(kar), then it may imply either “aim" or “ob- collection, kilo, kilogram] We observed that;
serve" depending on usage. The form ’কর + -e’ has two (1) distinct clusters were formed for the sense Rice and
orthographic forms; Roof but we didn’t find any cluster that corresponds to
the maneuver of the target word.
1. non-finite form. (2) although two distinct clusters were formed that
2. finite form (present, 3rd person). contains the indicator words.
for the roof sense of the target word, no suitable Hindi
The non-finite form following “ল (lakshya)" corre- word was suggested for these clusters.
sponds to the meaning “aim". Hence, to capture the
“observe" sense we define the rule as; To address the first problem we identified some
• If the root of the word succeeding the word patterns in the data for the maneuver sense of the word
ল (lakshya) has the root কর(kar) and the inflec- চাল(chaal). The identified patterns are as follows;
tion is other than “-e", then translate ল (lakshya) 1. The root forms of the words preceding চাল(chaal)
to “observe". contains the words পা া, মা ম, রাজনীিত,
আইন(palta, moksham, rajniti, ain) [return, co-
• If the inflection is “-e" an the part-of-speech is fi- gent, politics, law]
nite verb, then mark the sense as ল (lakshya) as 2. The root of the word immediately af-
Tokens Bharat mein yaha sal chaval ka utpadana kama hai se kafi sara pakavan bana mantri chal janata samajh raha

r1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 0 0 0 0 0

r2 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 1 1 1 1

Test Bilingual Dict(tbd ) 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0


r1 reference vector corresponding to चावल r2 reference vector corresponding to चाल

Table 4: Final reference and test vectors. r1 and r2 are the reference vectors

ter চাল(chaal) contains the words চাল, িহসাব, 7.2 Parameter Selection
খল(chaal,hisab,khel)[move,calculation,play]. In this section, we discuss the various parameters used
3. The words দাবা, জুয়া(daba, juya)[chess, gamble] for the target words in the algorithms.
occurs within a a window of 4 words preceding the
Parameters for Navigli and Crisafulli’s approach:
target word.
For some of the words for which the cooccurrence
For a given test instance, if atleast one of the con- graph is relatively sparse, a threshold of 0.00033 was
ditions is satisfied, then translate the target word to used. However, for other words the cutoff Dice coeffi-
चाल(chaal). cient value ranges from 0.03 to 0.05.
The second problem is due to the absence of any The triangle/square score threshold values ranges be-
concept(synid) corresponding to the roof sense of the tween 0.24 to 0.68. We performed the experiments over
word চাল(chaal) in Hindi wordNet. We observed that a range of values and both triangle and square scores
some of the clusters in Bengali contains words that and reported the best results.
clearly indicates the roof sense of the word. Manually,
Paramerters for our community detection ap-
labelling these clusters with the word छत(chat)[roof]
proach: We experimented with the minimum clique
shows some further improvement in performance.
size k equal to 3 and 4, since any three nouns that occur
in a sentence shall form a three clique. During expan-
ক (kendra) sion of a clique, a new node is added to the community,
Among the senses listed in Table 1 the sense issue one at a time.
cannot be clearly identified from the contextual infor- In this algorithm, we can control the degree of over-
mation. However, it was found that the word কের(kore) lap among the initial(seed) cliques and final communi-
immediately follows the word ক (kendra) when used ties. In our experiment we considered the clique over-
in the sense issue. lap values in the range of 0.4 to 0.7, and the community
As a post-processing step, we checked the word imme- overlap values between 0.4 to 0.8.
diately following the word ক (kendra) in the context. We got best results for minimum clique size of 3, ini-
If the word is কের(kore) we identified its sense as that tial clique overlap degree range 0.4 to 0.6, community
of issue. We got a substantial improvement in perfor- overlap degree values in range 0.6 to 0.7. However,
mance after incorporating this corrective step. we observed that the value of the parameter α ranges
widely from 1.0 to 4.5 depending on the edge density
6.6.1 Results of translation prediction after of the graph which in turn is proportional to the number
rule-based correction of sentences retrieved for a given target word.
The results of the translation prediction after rule- The clusters obtained by this method, are tagged
based correction are summarized in Table 5. with Hindi translations of the target word and finally
the tagged clusters are used to predict the Hindi trans-
7 Experiment and Result lation of a target Bengali word in a given test context.

7.1 Context Refinement 7.3 Result


FIRE 2011 [http://www.isical.ac.in/ fire] Bengali and Due to resource constraint we evaluated our system on
Hindi News corpora were used for the experiments. For 10 Bengali words. For each target word we consid-
construction of the co-occurrence graph the words were ered 100 to 120 sentences for evaluation. In phase 1,
converted to their root forms and only the nouns were our community detection method, the works of (Nav-
retained for further processing. We wanted to test our igli and Crisafulli, 2010) and (Jurgens, 2011) was used
system on a small amount of data. We used the most to generate the sense clusters. The vector space ap-
frequent nouns and the number of the nouns ranges be- proach was used in phase 2 for translation prediction
tween 150 to 300, depending the volume of the data for all the three cases of phase 1. In Table 1 we give the
obtained from the corpus for a target query word. The list of glosses in which the Bengali polysemous words
words in the Hindi corpus were also converted to their on which we have tested our system. The results are
root forms and were used in construction of the vector given in Table 2. In Table 5, we report the improve-
space and the reference vectors for the translation. ment in results for some of the words after rule-based
Results of prediction of Hindi translation
Accuracy before correction Accuracy after correction
Word Navigli’s Jurgens’ Ap- GCE com- Navigli’s Jurgens’ Ap- GCE com-
Approach(%) proach(%) munuty Approach(%) proach(%) munuty
Detection Detection
approach(%) approach(%)
অথ 54 35 48 83 70.83 83.3
চাল 36 51 54 47 63 70.85
ক 22 22 47.8 50.7 55.6 70
ল 16 19.6 41 39 47.5 67
Average 32 31.9 47.7 54.93 59.23 72.79
Accuracy

Table 5: Results graph-based approaches for prediction of Hindi translation of Bengali words before and after
rule-based correction. The accuracy for words অথ , চাল, ল and ক improved by rules

correction. A. R. Balamurali, Aditya Joshi, and Pushpak Bhat-


tacharyya. 2011. Harnessing wordnet senses for su-
8 Conclusion pervised sentiment classification. In EMNLP ’11,
pages 1081–1091, Stroudsburg, PA, USA. Associa-
We have developed an unsupervised WSD system tion for Computational Linguistics.
for Bengali which may be used to improve Bengali-
Hindi machine translation. The results of the first Peter F. Brown, Vincent J. Della Pietra, Vincent J. Della
phase(graph-based approach followed by vector space Pietra, and Robert L. Mercer. 1991. Word-sense
model approach) shows that the system performs res- disambiguation using statistical methods. In ACL
’91, pages 264–270.
onably well even when a comparable corpus is used
instead of a parallel corpus. However, for certain cate- Debasri Chakrabarti,Dipak Kumar Narayan, Prabhakar
gories of words/senses, the performance is poor. In the Pandey, Pushpak Bhattacharyya. 2002. Experiences
second phase (Rule-based correction phase) we have in Building the Indo WordNet: AWordNet for Hindi.
shown that for some abstract senses of certain words In Proceedings of theFirstGlobal WordNet Confer-
for which the performance of the first phase was not ence, 2002.
good enough, the disambiguation can be done by iden- Boxing Chen, George F. Foster, and Roland Kuhn.
tifying certain patterns in the inflectional forms of the 2010. Bilingual sense similarity for statistical ma-
the word itelf or the words in the neighbourhood of the chine translation. In ACL, pages 834–843.
target word. This is an initial attempt to test the ef-
fect of manually defined rules on WSD. We need to David Jurgens. 2011. Word sense induction by com-
work further on the generalization of rules for particu- munity detection. In TextGraphs-6, pages 24–28,
Stroudsburg, PA, USA. Association for Computa-
lar classes of words and the effect of the generalization.
tional Linguistics.
Future work may study the effect on MT system when
this module is integrated into some MT system. Mitesh M. Khapra, Sapan Shah, Piyush Kedia, and
Pushpak Bhattacharyya. 2009. Projecting param-
eters for multilingual word sense disambiguation.
References In EMNLP ’09, pages 459–467, Stroudsburg, PA,
USA. Association for Computational Linguistics.
Eneko Agirre and Oier Lopez de Lacalle. Ubc-alm:
Combining k-nn with svd for wsd. Mitesh M. Khapra, Salil Joshi, and Pushpak Bhat-
tacharyya. 2011a. It takes two to tango: A bilingual
Eneko Agirre, David Martínez, Oier López de Lacalle, unsupervised approach for estimating sense distribu-
and Aitor Soroa. 2006. Two graph-based algorithms tions using expectation maximization. In IJCNLP,
for state-of-the-art wsd. In EMNLP ’06, pages 585– pages 695–704.
593, Stroudsburg, PA, USA. Association for Com-
putational Linguistics. Mitesh M. Khapra, Salil Joshi, Arindam Chatterjee,
and Pushpak Bhattacharyya. 2011b. Together we
Marianna Apidianaki, Guillaume Wisniewski, Artem can: bilingual bootstrapping for wsd. In ACL HLT
Sokolov, Aurélien Max, and François Yvon. 2012. ’11, pages 561–569, Stroudsburg, PA, USA. Associ-
Wsd for n-best reranking and local language model- ation for Computational Linguistics.
ing in smt. In SSST-6 ’12, pages 1–9, Stroudsburg,
PA, USA. Association for Computational Linguis- Yoong Keok Lee and Hwee Tou Ng. 2002. An em-
tics. pirical evaluation of knowledge sources and learning
algorithms for word sense disambiguation.
Marianna Apidianaki. 2009. Data-driven semantic
analysis for multilingual wsd and lexical selection C. Lee, F. Reid, A. McDaid, and N. Hurley. 2010. De-
in translation. In EACL, pages 77–85. tecting highly-overlapping community structure by
greedy clique expansion. In Workshop - ACM KDD-
SNA.
Dekang Lin. 1998. Automatic retrieval and cluster-
ing of similar words. In COLING ’98, pages 768–
774, Stroudsburg, PA, USA. Association for Com-
putational Linguistics.
Piyush Kedia Mitesh M. Khapra, Sapan Shah and Push-
pak Bhattacharyya. 2009. Projecting parameters for
multilingual word sense disambiguation. In EMNLP
’09.
Piyush Kedia, Mitesh M. Khapra, Sapan Shah and
Pushpak Bhattacharyya. 2010. Domain-specific
word sense disambiguation combining corpus based
and wordnet based parameters. 5th International
Conference on Global Wordnet (GWC 2010),Mum-
bai.
Saurabh Sohoney, Mitesh M. Khapra, Anup Kulkarni
and Pushpak Bhattacharyya. July 2010. All words
domain adapted wsd: Finding a middle ground be-
tween supervision and unsupervision. In ACL ’10,
Uppsala, Sweden.
Sashank Chauhan, Soumya Nair, Mitesh M. Khapra,
Pushpak Bhattacharyya and Aditya Sharma. 2008.
Domain specific iterative word sense disambiguation
in a multilingual setting. In ICON ’08.
Roberto Navigli and Giuseppe Crisafulli. 2010. Induc-
ing word senses to improve web search result clus-
tering. In EMNLP ’10, pages 116–126, Cambridge,
MA, October. Association for Computational Lin-
guistics.
Hwee Tou Ng and Hian Beng Lee. 1996. Integrating
multiple knowledge sources to disambiguate word
sense: An exemplar-based approach. In ACL ’96,
pages 40–47.
Ted Pedersen and Rebecca Bruce. 1997. Distinguish-
ing word senses in untagged text. In EMNLP ’97,
pages 197–207.
Hinrich Schütze. 1998. Automatic word sense
discrimination. Comput. Linguist., 24(1):97–123,
March.
Jean Véronis. 2004. Hyperlex: lexical cartography
for information retrieval. Computer Speech & Lan-
guage, 18(3):223–252.
Špela Vintar, Darja Fišer, and Aljoša Vrščaj. 2012.
Were the clocks striking or surprising?: using wsd
to improve mt performance. In the Joint Workshop
on ESIRMT andHyTra, EACL 2012, pages 87–92,
Stroudsburg, PA, USA. Association for Computa-
tional Linguistics.
Ellen M. Voorhees. 1993. Using wordnet to disam-
biguate word senses for text retrieval. In SIGIR ’93,
pages 171–180, New York, NY, USA. ACM.
Dominic Widdows and Beate Dorow. 2002. A
graph model for unsupervised lexical acquisition.
In 19th International Conference on Computational
Linguistics, pages 1093–1099.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy