ICON-2013 Submission 36
ICON-2013 Submission 36
ICON-2013 Submission 36
Translation
Ayan Das Sudeshna Sarkar
Dept. of Computer Science and Engg. Dept. of Computer Science and Engg.
Indian Institute of Technology Indian Institute of Technology
Kharagpur, India Kharagpur, India
ayandas84@gmail.com shudeshna@gmail.com
Algorithm 2: Labelling the clusters with Hindi obtained from paddy seeds] and 6132 - পঁ াচ, কৗশল,
words চাল(pyanch,koushal,chaal)[tactics,maneuver,move] in
input : Word clusters corresponding to the the Bengali sense dictionary. When the Hindi word-
Bengali query word, Bengali-Hindi net is queried with these two synids it returns the
bilingual dictionary synsets that contains the synsets containing the words
output: Assignment of a Hindi word to each चावल(chaval)[rice] and चाल(chaal)[tactics] respectively.
cluster Thus, these two words are considered as the possible
translations of the Bengali word.
1 Translate the Bengali clusters obtained from the
community detection algorithm to Hindi using a A mapping is available between the synset/concept
bilingual dictionary; identifiers in Hindi Wordnet and Bengali sense dictio-
2 Project each cluster onto the vector space defined nary. This feature help us to automate the process of
in 6.1; finding the set of all possible translations of a Bengali
3 find the cosine similarity of each cluster with the word in Hindi. Given a Bengali word we first searched
reference vectors; the Bengali sense dictionary for synset ids of all the
4 Label each cluster with the Hindi translation synsets which contain the word. We used these sense
whose reference has the highest cosine similarity ids so obtained, to query the Hindi Wordnet for all the
with the cluster; corresponding synsets. The set of all the unique Hindi
words contained in Hindi synsets corresponding to the
synset ids are considered to be the possible Hindi trans-
lations of the target Bengali word.
6.3 Step 3: Prediction of translation of target If the mapping between the two wordnets were not
word using the clusters available, one would need to do alignment of the two
The steps involved in prediction of Hindi translation synsets for finding the set of Hindi translations of the
of target Bengali word from the clusters are listed in Bengali words and vice versa.
Algorithm 3: When the Hindi Corpus is searched with the two
words we get the following sentences. भारत में इस साल
6.4 An example of vector space model based चावल का उत्पादन कम है(Bharat mein is sal chaval ka
approach utpadana kama hai)[The yield of rice is less in India
We give a small example to show, how the use this year], चावल से काफी सारे पकवान बनते है(Chavala
of vector space approach works. Let the target se kafi sare pakavan bante hai) [A lot of dishes are pre-
word be "{চাল}"(chaal). The word is contained pared from rice] and मंत्री का चाल जनता समझ रहा
in the synsets 6303 - ধােনর বীজ থেক া খাদ है (Mantri ka chal janata samajh raha hai)[The public
শ (dhaner bij theke prapto khadya shasya)[food grain is able to look through the political maneuver of the
minister] and let the Bengali test sentence be এই বছর of test contexts. The postprocessing step was executed
চাল উ পাদন বড় ভােলা হেয়েছ(Ei bochor chaal utpadon on the entire test data since we had to ensure that the
boro valo hoyechhe).[The yield of rice is very good this corrective step assigns correct translation to the mis-
year] Translation of the test instance to Hindi is इस साल classified instances and leaves the correctly classified
चावल उत्पादन बड़ा अच्छा हुआ(Is sala chaval ka utpadana instances unchanged.
bada achha hua). The target words for which rules were defined and
The Hindi vectors are as given in the Table 3. the corresponding rules are given below.
The reference vectors 1 and 2 are combined to অথ (artha)
form the vector for चावल(chaval)[rice] and the vec- Some of the most frequent words that co-occur
tor 3 is the reference vector forचाल(chaal)[tactics] as with the word or bear similar meaning when used
shown in Table 4. The cosine similarity of the test in the "money" sense are : টাকা, ব া , পিরমাণ,
vector turns out to be greater with reference vec- ঋণ, কিমিট, বােজট, , সময়, ক , ম ক, লি ,
tor for चावल(chaval)[rice]. Hence, the translation বািণজ etc.(taka, bank, pariman, rin, committee,
चावल(chaval)[rice] is predicted for চাল(chaal) in the budget, kshetra,samay,kendra,mantrak,lagni,banijya
given test instance. etc.)[money, bank, quantity, debt, budget, committee,
area, time, center, ministry, investment, trade]
However, when the word is used to indicate the
6.5 Analysis of results sense meaning it is difficult to identify the sense from
We analyzed the data and the results so obtained and contextual cues. We found that there is no specific
found that the definite senses of most of the ambigu- set of words that could be used to identify this sense.
ous words could be identified from the contextual in- This is due the wide variation and low frequency of the
formation to a significantly high degree of accuracy. words that co-occur with the target word when used in
However, the accuracy of identification of the abstract the sense meaning in any context e.g.,
senses of most of the words was quite low. For exam- (1) উ েত মািলকা নােমর অথ রািন .(Urdute malika
ple, the word অথ (artha) in Bengali has the glosses namer artha rani)[The Urdu word malika means queen]
money and meaning, and the word ক (kendra) (2) সার েয়াগ না করার অথ ফলন কমা .(sar proyog
has the glosses center,regarding, central government na korar artha folon koma)[Not using fertilizer implies
and Place intended for some purpose (health centre reduction in yield]
or powerplant) etc. It was observed that the sense However, we found that it follows a general pattern
meaning of the word অথ (artha) and the sense re- of "meaning of something" i.e., possessive sense. The
garding of the word ক (kendra) could not be iden- word preceding the target word has a -র(-r) inflection
tified from the contextual information and the perfor- attached to the root word and in some cases the qual-
mance was poor for these words.But we found that ifying words কােনা(kono)[any], কানও(konoo)[any],
these senses can be distingushed by looking at their গভীর(gaveer)[deep], নানান(nanan)[multiple],
context patterns. There exists definite patterns such নানািবধ(nanabidho)[multiple] preceding অথ (artha)
as, certain patterns of verbs and inflectional forms of qualifies it. In such cases, we have to look for pos-
the words that occur in the neighborhood of the target sessive inflection in nouns preceding these qualifying
word in a given contexts, can be used for identification words. Hence, we define the rules as follows.
of these senses for many of the abstract concepts. (1) Check the word preceding the word অথ (artha), if
The results of the first phase of the experiment (i.e., it is contained in the set of qualifying words then look
the prediction of translation from the clusters) gives for the word preceding the qualifying word. Else take
mixed results. The poor performance of the system for the word preceding the qualifying word.
some words is either due to insufficient training data for (2) If the selected word has a -র(-r) inflection attached
a word or because the word has some abstract senses to the root word then tag the word অথ (artha) as
that cannot be captured by the contextual nouns. In or- मतलब(matlab)[meaning].
der to alleviate the problem due to abstract senses we
introduce the rule-based post-processing step, which ল (lakshya)
we discuss in section 6.6. We found that the distribution of a specific set of
words in the neighborhood of the word ল (lakshya),
when used in the the first two senses is rather consistent
but when used in the sense “observation" it suffers
from the same problem as that of the word অথ (artha).
6.6 Rule-based correction গণতে র মাধ েম দেশর িনপীিড়ত জনগেণর শাষণ
We identified some of these patterns from the data and মুি ঘটােনাই পািটর ল ।(Bahudaliyo ganotantrer
defined some rules based on these patterns. These rules madhyame desher nipirito janoganer shoshan mukti
were encoded into a program which was executed as a ghotanoi partir lakshya)[The main purpose of the
post-processing step to reassign the Hindi translation party is the liberation of the oppressed people through
to the instances of target Bengali word in a given set democracy] (Purpose)
Accuracy of prediction of Hindi translation of Bengali words by unsupervised graph-based approaches
GCE community
Word Navigli’s approach(%) Jurgens’ approach(%)
detection algorithm(%)
আচার(achar) 46 71 86
অথ(artha) 54 35 48
চাল(chaal) 36 51 54
ডাল(daal) 82 66 87
গালা(gola) 68 37.5 59
জাল(jaal) 96 47.3 54
ক (kendra) 22 22 47.8
ল (lakshya) 16 19.6 41
ণালী(pranaali) 75 30 71
রা া(raasta) 43 44 51.5
Average Accuracy 53.8 42.34 60
Table 2: Results of prediction of Hindi translation of Bengali words by unsupervised graph-based approaches
Tokens Bharat mein yaha sal chaval ka utpadana kama hai se kafi sara pakavan bana mantri chal janata samajh raha
Ref 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
Ref 2 0 0 0 0 1 0 0 0 1 1 1 1 1 1 0 0 0 0 0
Ref 3 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 1 1 1 1
r1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 0 0 0 0 0
r2 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 1 1 1 1
Table 4: Final reference and test vectors. r1 and r2 are the reference vectors
ter চাল(chaal) contains the words চাল, িহসাব, 7.2 Parameter Selection
খল(chaal,hisab,khel)[move,calculation,play]. In this section, we discuss the various parameters used
3. The words দাবা, জুয়া(daba, juya)[chess, gamble] for the target words in the algorithms.
occurs within a a window of 4 words preceding the
Parameters for Navigli and Crisafulli’s approach:
target word.
For some of the words for which the cooccurrence
For a given test instance, if atleast one of the con- graph is relatively sparse, a threshold of 0.00033 was
ditions is satisfied, then translate the target word to used. However, for other words the cutoff Dice coeffi-
चाल(chaal). cient value ranges from 0.03 to 0.05.
The second problem is due to the absence of any The triangle/square score threshold values ranges be-
concept(synid) corresponding to the roof sense of the tween 0.24 to 0.68. We performed the experiments over
word চাল(chaal) in Hindi wordNet. We observed that a range of values and both triangle and square scores
some of the clusters in Bengali contains words that and reported the best results.
clearly indicates the roof sense of the word. Manually,
Paramerters for our community detection ap-
labelling these clusters with the word छत(chat)[roof]
proach: We experimented with the minimum clique
shows some further improvement in performance.
size k equal to 3 and 4, since any three nouns that occur
in a sentence shall form a three clique. During expan-
ক (kendra) sion of a clique, a new node is added to the community,
Among the senses listed in Table 1 the sense issue one at a time.
cannot be clearly identified from the contextual infor- In this algorithm, we can control the degree of over-
mation. However, it was found that the word কের(kore) lap among the initial(seed) cliques and final communi-
immediately follows the word ক (kendra) when used ties. In our experiment we considered the clique over-
in the sense issue. lap values in the range of 0.4 to 0.7, and the community
As a post-processing step, we checked the word imme- overlap values between 0.4 to 0.8.
diately following the word ক (kendra) in the context. We got best results for minimum clique size of 3, ini-
If the word is কের(kore) we identified its sense as that tial clique overlap degree range 0.4 to 0.6, community
of issue. We got a substantial improvement in perfor- overlap degree values in range 0.6 to 0.7. However,
mance after incorporating this corrective step. we observed that the value of the parameter α ranges
widely from 1.0 to 4.5 depending on the edge density
6.6.1 Results of translation prediction after of the graph which in turn is proportional to the number
rule-based correction of sentences retrieved for a given target word.
The results of the translation prediction after rule- The clusters obtained by this method, are tagged
based correction are summarized in Table 5. with Hindi translations of the target word and finally
the tagged clusters are used to predict the Hindi trans-
7 Experiment and Result lation of a target Bengali word in a given test context.
Table 5: Results graph-based approaches for prediction of Hindi translation of Bengali words before and after
rule-based correction. The accuracy for words অথ , চাল, ল and ক improved by rules