Measuring Bilingual Corpus Comparability
Measuring Bilingual Corpus Comparability
Measuring Bilingual Corpus Comparability
c Cambridge University Press 2018 1
doi:10.1017/S1351324917000481
e-mail: eric.gaussier@imag.fr
3 China Electric Power Research Institute, Wuhan, China
e-mail: yangdan3@epri.sgcc.com.cn
Abstract
Comparable corpora serve as an important substitute for parallel resources in cases of under-
resourced language pairs. Previous work mostly aims to find a better strategy to exploit
existing comparable corpora, while ignoring the variety in corpus quality. The quality of
comparable corpora affects a lot its usability in practice, a fact that has been justified by
several studies. However, researchers have not been able to establish a widely accepted and
fully validated framework to measure corpus quality. We will thus investigate in this paper a
comprehensive methodology to deal with the quality of comparable corpora. To be exact, we
will propose several comparability measures and a quantitative strategy to test those measures.
Our experiments show that the proposed comparability measure can capture gold-standard
comparability levels very well and is robust to the bilingual dictionary used. Moreover, we
will show in the task of bilingual lexicon extraction that the proposed measure correlates well
with the performance of the real world application.
1 Introduction
A bilingual corpus is an important resource used to cross the language barrier in
multilingual Natural Language Processing (NLP) tasks such as Statistical Machine
Translation (SMT) (Och and Ney 2003; Bahdanau, Cho and Bengio 2015) and
Cross-Language Information Retrieval (CLIR) (Ballesteros and Croft 1997). Parallel
corpora, i.e., document collections comprised of texts that are translation of one
another, have been broadly used in cross-language NLP tasks. Their availability
remains however limited, especially for minority languages (Markantonatou et al.
2006). Parallel corpora publicly available only exist in narrow domains and between
a few language pairs. For example, the Europarl corpus (Koehn 2005) widely used
in SMT research was built by retrieving parallel texts from the proceedings of the
†This work was co-supported by Natural Science Foundation of China (Nos. 61300144
and 61572223), State Language Commission of China (No. YB125-132), Humanity
and Social Science Foundation of Ministry of Education of China (No. 15YJC870029)
and the Fundamental Research Funds for Central Universities (No. CCNU16A06015,
CCNU15A05062, CCNU17GF0005, CCNUSZ2017024).
Downloaded from https://www.cambridge.org/core. University of New England, on 03 Mar 2018 at 18:09:25, subject to the Cambridge Core terms of use,
available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S1351324917000481
2 B. Li et al.
European commission. Other parallel corpora such as the United Nations corpus
and the Hansard corpus1 suffer from the same problem.
As another type of bilingual corpora, a comparable corpus is in general easier
to obtain as the only requirement is that documents cover related content in
different languages. Previous work has shown that comparable corpora can be
successfully used in such applications as bilingual lexicon extraction (for example,
refer to Rapp (1999) and Chebel, Latiri and Gaussier (2017)), enhancement of
SMT systems (Munteanu, Fraser and Marcu 2004; AbduI-Rauf and Schwenk 2009;
Hazem and Morin 2016), enhancement of CLIR systems (Talvensaari et al. 2007),
as well as the modeling of topics across languages (Boyd-Graber and Blei 2009; Ni
et al. 2009). There is thus more and more evidence that comparable corpora can be
used to bridge the language barrier in applications where parallel resources are not
available.
In all these studies, the definition of what is a comparable corpus is rather
vague. For example, Ji (2009) defines a comparable corpus as a text collection
consisting of documents describing similar topics. In Munteanu et al. (2004) and
Hewavitharana and Vogel (2008), a comparable corpus is defined as a text collection
covering overlapping information. Sharoff, Rapp and Zweigenbaum (2013) state
that comparable corpora are less parallel corpora. Based on discussions in former
literature, we define here comparable corpora as document sets in different languages
that cover similar topics. Comparable corpora can thus be quite different from each
other, depending on the extent to which two monolingual parts of the comparable
corpus are related to each other. Intuitively, a parallel corpus can be seen as a
special case of the comparable corpus, corresponding to the highest comparability
level.2
Data driven NLP tasks depend a lot on the quality of the resources used,
and the better the corpus is, the better knowledge one can extract from it.
We thus conjecture that comparable corpora of higher quality3 will yield better
performance of applications relying on them, a fact that has actually been validated
in several previous studies (Li and Gaussier 2010; Skadina et al. 2010). Existing work
mining comparable corpora mostly builds and uses comparable corpora according
to humans’ simple intuitions. For instance, Robitaille et al. (2006) construct a
comparable corpus by using the search engine to retrieve web pages highly relative
to a set of queries given in two languages. Munteanu and Marcu (2006) build the
comparable corpus by using the existing new corpora in the same period and the
CLIR system to retrieve related document pairs. These studies are reasonable, but
there is still a risk that one might obtain and use a corpus of poor quality, leading
to unpredictable performance, since there has not been any method one could use
1
Both the United Nations corpus and the Hansard corpus are available from
http://www.ldc.upenn.edu.
2
By definition, the parallel corpus could consist of documents in quite different
topics/domains. In practice, parallel corpora normally consist of documents in a special
domain.
3
The quality of a multilingual corpus is normally affected by several factors, for instance
volume, parallelness and novelty. In this paper, we will mostly concentrate on parallelness.
Downloaded from https://www.cambridge.org/core. University of New England, on 03 Mar 2018 at 18:09:25, subject to the Cambridge Core terms of use,
available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S1351324917000481
Measuring bilingual corpus comparability 3
to tell the corpus quality in a quantitative way. Without such a measure, one could
not build and use a comparable corpus with full confidence.
The task of measuring text similarity has attracted much attention in the NLP field.
The Semantic Textual Similarity (STS) shared task, held as part of SemEval, 4 aims
to automatically predict the similarity of a pair of sentences. The classic strategies
for measuring monolingual text similarity could be extended to the cross-language
settings (e.g., Mathieu, Besancon and Fluhr (2004), Luong, Pham and Manning
(2015)), which is the most direct approach to realize the comparability measure. The
European FP7 project ACCURAT5 has proposed several comparability measures
according to this intuition. Besides, there are several other possibilities to investigate
the comparability. These studies will be discussed in more details in Section 2.
However, previous work has not been able to propose a systematic approach to
measure the corpus quality and to test the comparability measures themselves.
We will first establish in this paper measures to capture different comparability
levels. The proposed measures will then be examined on gold-standard comparability
levels to show their coherence with gold standards. In addition, the measures will
be tested in the task of bilingual lexicon extraction to show its correlation with the
performance of real world tasks. The remainder of the paper is organized as follows:
• In Section 2, we define more precisely the notion of comparability and develop
several comparability measures. Several other measures are also considered
in this section for comparison purposes.
• The comparability measures are evaluated, in Section 3, in terms of correl-
ation with gold-standard comparability levels and robustness to dictionary
coverage. In addition, the comparability measures are used in a practical
application, bilingual lexicon extraction, to show that the measure correlates
well with its usability.
• Several aspects of this study are discussed in Section 4, and the paper is
concluded in Section 5.
2 Comparability measures
A fine-grained comparability measure is important when one needs to choose the
best one from several comparable corpora prepared from different resources or when
one needs to enhance the quality of a given corpus of low quality in a systematic
way. We thus plan to develop in this section a quantitative measure to capture
various comparability levels.
As we have discussed in Section 1, comparable corpora serve as the substitute for
parallel resources in case of under-resourced language pairs. The ideal comparability
measures should rely, as less as possible, on parallel resources that are expensive to
obtain for under-resourced languages. Moreover, computationally cheap measures
are preferred, which can lead to feasible solutions for such tasks as corpus quality
enhancement (Li, Gaussier and Aizawa 2011).
4
http://alt.qcri.org/semeval2015/
5
http://www.accurat-project.eu
Downloaded from https://www.cambridge.org/core. University of New England, on 03 Mar 2018 at 18:09:25, subject to the Cambridge Core terms of use,
available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S1351324917000481
4 B. Li et al.
There have been only a few works trying to investigate the formal definition or
quantification of the quality of comparable corpora. Such works as Rayson and
Garside (2000) and Kilgarriff (2001) are early attempts to quantify how similar
two corpora in the same language are in terms of lexical content. Sharoff (2007)
investigates automatic ways of differentiating web corpora in terms of domains
and genres. Saralegi, San-Vicente and Gurrutxaga (2008) attempt to measure the
degree of comparability of two corpora in different languages by inferring a global
comparability from the similarity of all cross-language document pairs. This measure
is however computationally infeasible if the corpora contain a large number
of documents. Under the seventh European framework,6 researchers involved in
the project ACCURAT (Skadina et al. 2010) have studied several measures and
metrics for assessing corpus comparability and document parallelism of under-
resourced languages. In addition to the above studies devoted to comparable
corpora, researchers have recently developed several cross-lingual models of word
embeddings (Hermann and Blunsom 2014; Luong, Pham and Manning 2015; Vulic
and Moens 2015), which could be used to model cross-lingual semantic similarity.
The word embedding approaches however rely on a training procedure on parallel
or comparable corpora, which is computationally expensive. We will use some of
these measures as baselines for comparison purposes only.
We first introduce below a set of measures that are extensions of the ones proposed
in our former work (Li and Gaussier 2010). All the measures we are going to consider
can be classified into one of the following categories:
6
http://cordis.europa.eu/fp7/
Downloaded from https://www.cambridge.org/core. University of New England, on 03 Mar 2018 at 18:09:25, subject to the Cambridge Core terms of use,
available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S1351324917000481
Measuring bilingual corpus comparability 5
MT approaches use parallel corpora for training, which are rare for under-resourced
language pairs. Cross-lingual word embedding approaches make use of the training
procedure on parallel or comparable corpora, which is computationally expensive.
Those approaches are presented here as baselines only for the purpose of comparison.
For convenience, the following discussions will be made in the context of French–
English comparable corpora.
Comparability measures can be defined on various levels such as sentences,
documents or whole corpora. Comparability measures on the document or sentence
level simply amount to measuring the similarity of sentences/documents across
languages, which is a classic task in NLP (e.g., Luong, Pham and Manning (2015)).
It is not clear whether these classic techniques can be directly used to capture the
differences between various comparable corpora. We intend to define comparability
so as to reflect the usability of the comparable corpus in NLP tasks. It is generally
considered that different types of multilingual corpora display different levels of
comparability. For example, the following comparable corpora have decreasing
comparability levels7 :
A good comparability measure should correlate well with the different compar-
ability levels above. In other words, it should be able to capture (fine-grained)
differences on the levels of comparability. We will return to this issue in Section 3.
Prior to that, we first introduce several comparability measures below.
7
A similar list of decreasing comparability levels can be found in Sharoff et al. (2013).
Downloaded from https://www.cambridge.org/core. University of New England, on 03 Mar 2018 at 18:09:25, subject to the Cambridge Core terms of use,
available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S1351324917000481
6 B. Li et al.
English part Ce . If we consider the translation process from the English part to the
French part, the comparability measure Mef can be defined as the expectation of
finding, for each English word we in the vocabulary Cev of Ce , its translation in the
vocabulary Cfv of Cf . The definition for Mef directly reflects our intuition. As one
can note, a general English–French bilingual dictionary D, independent from the
corpus C, is required to judge if two words are the translation of each other. Let σ
be a function that indicates whether a translation from the translation set Tw of a
word w is found in the vocabulary C v of a corpus C, i.e.,:
1 iff Tw ∩ C v = ∅
σ(w, C ) =
v
(1)
0 else
As assumed above, the comparable corpus and the general bilingual dictionary are
independent from one another. It is thus natural to assume that the dictionary D
covers a substantial part of Cev and this substantial part can well represent the whole
vocabulary. It means the expectation of finding the translation in Cfv of a word w in
Cev can be approximated by that in Cev ∩ Dev . This assumption leads to equation (3):
Mef (Ce , Cf ) = σ(w, Cfv ) · Pr(w) (3)
w∈Cev ∩Dev
where Dev is the English vocabulary of the given bilingual dictionary D. The
derivation from equations (2)–(3) seems trivial. It does help us bring up the notion
of dictionary coverage and the robustness measure in Section 3.2.3.
There are several possibilities to estimate Pr(w) in equation (3). However, the
presence of common words suggests that one should not solely rely on the number
of occurrences (i.e., term frequency), which is a broadly used approach in other
fields like unigram language models, since the high-frequency words will dominate
the final results. For example, in the Europarl corpus, the English word Europe and
the French word Europe are very common words. It means that even if one piece
of English text and one piece of French text are randomly picked from the Europarl
corpus, we can still expect to find many translation pairs Europe–Europe. To avoid
the bias common words can introduce in the comparability measure, one can weight
each word w as ρw through TF-IDF or through the simple Presence/Absence (P/A)
criterion. In this case, the part Pr(w) can be estimated as
ρw
Pr(w) = (4)
w∈Cev ∩Dev ρw
With the P/A criterion, the weight ρw is 1 if and only if w ∈ Cev ∩ Dev , otherwise
the value is 0. Alternatively, considering the TF-IDF weight for each word w, the
weighting function ρw can be defined in a standard TF-IDF style (Salton, Wong
and Yang 1975).
Downloaded from https://www.cambridge.org/core. University of New England, on 03 Mar 2018 at 18:09:25, subject to the Cambridge Core terms of use,
available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S1351324917000481
Measuring bilingual corpus comparability 7
Similarly, when considering the translation process from the French part to the
English part, the counterpart of Mef , Mfe , can be written as
Mfe (Ce , Cf ) = σ(w, Cev ) · Pr(w) (5)
w∈Cfv ∩Dfv
We give here additional comments on the P/A criterion. With this criterion, Pr(w)
for each w in C v ∩ Dv is directly |C v ∩D
1
v | . One can then obtain the combined measure
M as
w∈Cev ∩Dev σ(w, Cf ) + w∈Cfv ∩Dfv σ(w, Ce )
v v
M(Ce , Cf ) = (7)
|Cev ∩ Dev | + |Cfv ∩ Dfv |
which corresponds to the overall proportion of words for which a translation can
be found in the comparable corpus. One can notice from equation (7) that M is a
symmetric measure.
Downloaded from https://www.cambridge.org/core. University of New England, on 03 Mar 2018 at 18:09:25, subject to the Cambridge Core terms of use,
available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S1351324917000481
8 B. Li et al.
Downloaded from https://www.cambridge.org/core. University of New England, on 03 Mar 2018 at 18:09:25, subject to the Cambridge Core terms of use,
available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S1351324917000481
Measuring bilingual corpus comparability 9
the weight of the word we (resp. wf ) in the vector ve (resp. vf ). In this paper, the
weight is defined in a standard TF-IDF style. The dot product between the vectors
ve and vf is given by
< ve , vf >= f(we ) f(wf ) (8)
we ∈
ve wf ∈Twe ∩
vf
where Twe is the translation set of we in the bilingual dictionary. Different measures
can be derived from the above dot product. We make use here of the standard
cosine similarity, which yields the comparability measure M v :
< ve , vf >
ve , vf ) =
M v = cos(
2 2
ve (f(we ) + (
we ∈ vf f(wf )) )
wf ∈Twe ∩
The vector space model has shown rather satisfactory performance in classic tasks
such as document classification and clustering. The content we will model in this
work is corpus but not document, where the most significant difference is that a
corpus usually consists of documents in many different topics. According to the idea
of topic models (Blei and Jordan 2003), each document can be seen as a mixture of
separate topics. So the vector representation of the whole corpus can then be seen
as a rather complicated mixture of various topics, making the vector comparison
not as accurate as that in the case of documents.
8
http://translate.google.com
Downloaded from https://www.cambridge.org/core. University of New England, on 03 Mar 2018 at 18:09:25, subject to the Cambridge Core terms of use,
available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S1351324917000481
10 B. Li et al.
(1) Whether the designed comparability measures can capture different comparab-
ility levels in the corpora;
(2) Whether the proposed measures are robust to dictionary coverage.
9
http://www.statmt.org/europarl/
10
http://trec.nist.gov/
11
http://www.clef-campaign.org
12
The Wikipedia dump files can be downloaded at http://download.wikimedia.org
Downloaded from https://www.cambridge.org/core. University of New England, on 03 Mar 2018 at 18:09:25, subject to the Cambridge Core terms of use,
available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S1351324917000481
Measuring bilingual corpus comparability 11
Downloaded from https://www.cambridge.org/core. University of New England, on 03 Mar 2018 at 18:09:25, subject to the Cambridge Core terms of use,
available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S1351324917000481
12 B. Li et al.
Fig. 1. Constructing the test corpus group Ga with gold-standard comparability levels.
the AP corpus, meaning that parts covering different topics are introduced,
leading to comparable corpora belonging to the class non-parallel corpora
covering different topics.
We now give more details on this construction process. The first group Ga is built
from the Europarl corpus through the following two steps:
(1) The English (and its corresponding French) part of the Europarl corpus is split
into ten equal parts in terms of number of sentences, leading to ten parallel
corpora denoted as P1 , P2 , . . . , P10 . The comparability level of these ten parallel
corpora are arbitrarily set to one (i.e., the highest level);
(2) For each parallel corpus, e.g., Pi (i = 1, 2, . . . , 10), we replace a certain proportion
p of the English part of Pi with content of the same size, again in terms of
the number of sentences, from another parallel corpus Pj (j = i), producing
a new corpus Pi likely to contain less translation pairs and thus be of lower
comparability. For each Pi , as p increases one obtains a series of comparable
corpora with decreasing comparability scores. In our experiments, p is increased
from 0 to 1 with a step of 0.01. All the Pi and their respective descendant corpora,
according to different values of p, constitute the group Ga , the comparability
score of each corpora being set to 1 − p. As a result, we have 1,000 comparable
corpora in Ga . This process is illustrated in Figure 1.
The difference between building the corpora in Gb and in Ga is that, in Gb ,
the replacement in Pi is done with documents from the AP corpus and not from
another parallel corpus Pj from Europarl. Compared with the corpora in Ga , we
further degrade the parallel corpus Pi in Gb since the AP corpus covers different
topics from Europarl.
Downloaded from https://www.cambridge.org/core. University of New England, on 03 Mar 2018 at 18:09:25, subject to the Cambridge Core terms of use,
available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S1351324917000481
Measuring bilingual corpus comparability 13
In Gc , we start with ten comparable corpora Pi from Ga having the lowest
comparability score (i.e., 0). They thus contain documents from Europarl that are
not the translation of each other. Each Pi is further altered by replacing certain
portions, according to the same proportion p used before, with documents from
the AP corpus. Although Pi itself is comparable and not parallel, its English and
French parts are likely to cover similar topics embedded in the Europarl corpus.
Replacing certain parts of Pi with the content from AP will thus further degrade
the comparability levels of Pi .
From the process of building the comparable corpora in Ga , Gb and Gc , one can
note that the gold-standard comparability scores in different groups, e.g., Ga and
Gc , cannot be compared with each other directly, since the comparability scores
are normalized between 0 and 1 in each group of corpora. These comparability
scores do not represent absolute judgements on the comparability of the corpora
considered, but rather relative scores within each group of corpora.
Downloaded from https://www.cambridge.org/core. University of New England, on 03 Mar 2018 at 18:09:25, subject to the Cambridge Core terms of use,
available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S1351324917000481
14 B. Li et al.
Table 2. Correlation scores of the vocabulary overlapping measures with the gold
standard
c c
Mef Mfe M Mef Mfe Mc
The rows TF-IDF correspond to the TF-IDF weighting schema, and the rows P/A correspond
to the Presence/Absence weighting method.
Downloaded from https://www.cambridge.org/core. University of New England, on 03 Mar 2018 at 18:09:25, subject to the Cambridge Core terms of use,
available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S1351324917000481
Measuring bilingual corpus comparability 15
c
(a) Mef (b) Mef
(e) M (f) M c
v v
Fig. 2. (Colour online) Evolution of the measures Mef , Mfe , M, Mef , Mfe and M v w.r.t.
gold standard on the corpus group Gc (x -axis: gold-standard comparability scores, y-axis:
c c
comparability scores from the measures). (a) Mef . (b) Mef . (c) Mfe . (d) Mfe . (e) M. (f) M c .
document and a large French document collection, it is very likely that one can
find the translations for most of the English words, even though the two corpora
are lowly comparable, because the English vocabulary in consideration is very small
and it is easy to find all the translations even if the French corpora are not really
comparable to the English content. In our case, since the average sentence length in
AP is larger than that of Europarl, we increase the length of the English part of the
test corpora remarkably when degrading the corpora in Gb and Gc , leading to poor
performance of Mfe . In order to further support our judgement, we have also tried
Downloaded from https://www.cambridge.org/core. University of New England, on 03 Mar 2018 at 18:09:25, subject to the Cambridge Core terms of use,
available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S1351324917000481
16 B. Li et al.
Table 3. Correlation scores of the baseline comparability measures with the gold
standard
Mv M g1 M g2 Ms
to manually control the sentence length in the AP corpus, namely that we intend to
choose in AP those sentences of which the length is similar to that in the original
corpus. The results then show that Mfe works well under the new settings without
a biased length. The length related problem can be overcome by M that considers
the translation in both directions.
c c
We now turn to the contextual versions Mef , Mfe and M c . All of the three
measures perform very well on the three groups of corpora. Let us pay attention
c
to the measure Mfe and its context version Mfe . The former one is sensitive to the
corpus length, whereas the latter one is far less sensitive. We conjecture here that
contextual information helps identify the correct translations in cases where they
are a lot of possible translations (as is the case with unbalanced, in terms of size,
collections), and thus can still capture different comparability levels. Furthermore,
as the results show, M and all the context-based measures are able to capture all the
differences in comparability artificially introduced in the degradation process we have
considered in Section 3.2.1. Last, one can conclude from the results that it is easier to
capture the different comparability levels in Gb than in Ga and Gc . This is due to the
fact that the differences in Ga (based on the same corpus) are less marked, and thus
more difficult to identify. On the opposite, Gc comprises corpora with low levels of
comparability, also more difficult to separate out. Gb based on a parallel corpus with
additions from a different corpus, displays comparability levels easier to identify.
We finally list in Table 3 the results from the baseline measures. The results
shown here are obtained in the same way as the ones for the vocabulary overlapping
measures in Table 2. From the results one can find that the recently introduced
bilingual skip-gram measure M s performs the best on all the three groups of
corpora, the measures M g2 and M v performing worse. All these baseline measures
do not perform as well as the measures M (weighed by P/A) or the three context-
based measures only relying on the vocabulary overlapping approach. The standard
vector space is broadly used to represent textual elements in previous studies aiming
at capturing similarity at the sentence or document levels. What our experiments
reveal here is that this approach does not yield interesting performance (compared
to other approaches) for measuring comparability on the corpus level. The reason is
probably that noise involved in dictionary-based mapping in equation (8) destroys
the capability of monolingual vector space model. This finding is also partially
supported by a recent report of the ACCURAT project13 that vector representation
13
Related materials can be found on the deliverables of the project. The project website is:
http://www.accurat-project.eu.
Downloaded from https://www.cambridge.org/core. University of New England, on 03 Mar 2018 at 18:09:25, subject to the Cambridge Core terms of use,
available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S1351324917000481
Measuring bilingual corpus comparability 17
1 Wiki-En+Wiki-Fr Wiki-En+MON94 Y
2 Wiki-En+Wiki-Fr Wiki-En+SDA94 Y
3 Wiki-En+Wiki-Fr LAT94+Wiki-Fr Y
4 Wiki-En+Wiki-Fr GH95+Wiki-Fr Y
One could conjecture that the first corpus should be more comparable than the second one in
each group according to the intuition, which constitutes the gold standard. The last column
corresponds to the experiment results where Y denotes a coherence with gold standard.
Downloaded from https://www.cambridge.org/core. University of New England, on 03 Mar 2018 at 18:09:25, subject to the Cambridge Core terms of use,
available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S1351324917000481
18 B. Li et al.
14
This is also true for all the eleven comparability levels, although we only plot 4 in the
figure.
Downloaded from https://www.cambridge.org/core. University of New England, on 03 Mar 2018 at 18:09:25, subject to the Cambridge Core terms of use,
available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S1351324917000481
Measuring bilingual corpus comparability 19
(a) M
(b) M c
Fig. 3. (Colour online) Evolution of M and M c w.r.t. different dictionary coverages on
comparable corpora P1 , P10.7 , P10.4 and P10.1 in Ga (x -axis: different dictionary coverages;
y-axis: comparability scores from M or M c ). (a) M. (b) M c .
coverage is roughly above 0.62. The same conclusion can be drawn from the
inspection of Figure 3(b). One can thus conclude from this qualitative analysis
that both M and M v are robust to the changes of the dictionary after a certain
point.
We have drawn for comparability measures the intuitive conclusion regarding
robustness from Figure 3. In order to analyze the results quantitively, we will first
try to define the degree of robustness of a comparability measure.
Downloaded from https://www.cambridge.org/core. University of New England, on 03 Mar 2018 at 18:09:25, subject to the Cambridge Core terms of use,
available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S1351324917000481
20 B. Li et al.
Fig. 4. (Colour online) The frequency histogram of the comparability scores (from M) on P10.1
between the coverage 0.56 and 0.58 (x -axis: comparability scores; y-axis: frequency), with the
associated normal approximation.
Definition 1
Let us assume that we have different comparable corpora C1 , C2 , . . ., Ck , with gold-
standard comparability levels that are increasing, which can be written as: C1 ≺ C2 ≺
. . . ≺ Ck (the symbol ≺ is used here to denote the relation less in the gold-standard
comparability levels). We further assume to have a bilingual dictionary D such that the
coverage of D on all the k corpora Ci (i = 1, 2, . . . , k) belongs to a range [r, r +], with
being a small fixed value. Then, we define the degree of robustness of a comparability
measure M w.r.t the dictionary coverage r as
χ(M, r) = avg Pr(M(Ci+1 ) > M(Ci ))
i∈1,2,...,k−1
Through the above definition, one measures to which extent a certain measure,
associated with a certain coverage value r, can well separate different comparability
levels in the average case.
We now turn to the problem of estimating the degree of robustness of the
comparability measures M and M v . Let us first focus on the measure M. We notice
from Figure 3 that, for each of the test corpora, e.g., P10.1 in Figure 3(a), the
comparability scores corresponding to a certain coverage range (e.g., from 0.56 to
0.58, identified by the circle in the figure) follow a normal distribution, according to
the Shapiro–Wilk test (Shapiro and Wilk 1965) at the significance level 0.05. This
fact is also illustrated by the frequency distribution histogram in Figure 4. Thus, in
a coverage range of size 0.02 (i.e., the value in the Definition 1), the comparability
scores of M for a specific corpus can be modeled as a normally distributed variable
Z. Hence, on each span, the scores of M on two bilingual corpora, say P10.1 and P10.2 ,
can be described as two normally distributed variables denoted as Z0.1 and Z0.2 of
Downloaded from https://www.cambridge.org/core. University of New England, on 03 Mar 2018 at 18:09:25, subject to the Cambridge Core terms of use,
available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S1351324917000481
Measuring bilingual corpus comparability 21
Table 5. The degree of robustness with two measures M, M c and different dictionary
coverages on all the eleven comparability levels
The coverage range is set to [r, r + ] as in Definition 1 with being fixed to 0.02.
which the parameters (i.e., the mean μ and variance σ 2 ) can be estimated from the
samples, i.e., all the comparability scores on P10.1 and P10.2 in the specific coverage
range. With the estimated parameters, we can write:
To compute the degree of robustness in Definition 1, one needs to obtain all the
probability Pr(M(Ci+1 ) > M(Ci )). Taking here the corpora P10.2 and P10.1 as an
example, one needs to estimate Pr(Z0.2 > Z0.1 ). With the independence assumption
between variables Z0.1 and Z0.2 , the new variable Z0.2 − Z0.1 satisfies:
and computing Pr(Z0.2 > Z0.1 ) amounts to the computation of Pr(Z0.2 − Z0.1 > 0),
which can be done through tabulations of the normal distribution. In Table 5, we
list, for different coverage ranges, the robustness values for both comparability
measures M and M c on all the eleven comparability levels. Let us first pay
attention to robustness of the measure M. One can find from the results that
the higher the dictionary coverage is, the more reliable M is to distinguish between
different comparability levels. Furthermore, when the dictionary coverage is above
a certain threshold (e.g., 0.60), we have a high confidence (≥ 0.85) that the different
comparability levels between the corpus pairs can be reliably captured by M, so
that the measure is robust in a given coverage range, here set to 0.02. The same
conclusions can be drawn for the other comparable corpus pairs we have constructed
(from Gb and Gc ).
One can further conclude from Table 5 that, compared to M, it is easier to
achieve high confidence, and thus higher robustness, with M c . Even when the
dictionary coverage is only 0.50, the confidence is larger than 0.86 that M c captures
the different comparability levels. However, the computation complexity of the
context-based measures is usually very expensive. For this reason, we will only make
use, in the remainder of this study, of the measure M weighted by the P/A criterion.
Downloaded from https://www.cambridge.org/core. University of New England, on 03 Mar 2018 at 18:09:25, subject to the Cambridge Core terms of use,
available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S1351324917000481
22 B. Li et al.
Downloaded from https://www.cambridge.org/core. University of New England, on 03 Mar 2018 at 18:09:25, subject to the Cambridge Core terms of use,
available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S1351324917000481
Measuring bilingual corpus comparability 23
C0 C1 C2
The comparability scores (by M) of C 0 , C 1 and C 2 are 0.882, 0.912 and 0.916, respectively.
obtained, which amounts in this case to the proportion of lists containing the correct
translation (in case of multiple translations, a list is deemed to contain the correct
translation as soon as one of the possible translations is present). This evaluation
procedure has been used in previous work (e.g., Gaussier et al. (2014)) and is now
standard for the evaluation of lexicons extracted from comparable corpora. In this
study, we consider bilingual lexicon extraction with potential usage in CLIR and set
N to 20 following previous studies as Gaussier et al. (2014). Furthermore, several
studies have shown that it is easier to find the correct translations for frequent
words than for infrequent ones (Pekar et al. 2006). To take this fact into account,
we distinguished different frequency ranges to assess the impact of corpus quality
measured by M for all frequency ranges. Words with frequency less than 100 are
defined as low-frequency words (WL ), whereas words with frequency larger than
400 are high-frequency words (WH ), and words with frequency in between are
medium-frequency words (WM ).
The results obtained are displayed in Table 6. They show that the standard
approach performs significantly better on the improved corpora C 1 /C 2 than on the
original corpus C 0 . The overall precision is increased by 5.3% on C 1 (corresponding
to a relative increase of 26%) and 9.5% on C 2 (corresponding to a relative increase
of 51%). It should also be noticed that the performance of the standard approach
is better on C 2 than on C 1 . When considering different frequency ranges, one could
come to the same conclusion. In a word, those results reveal that performance of
lexicon extraction is coincident with the comparability scores computed for the three
corpora, which further proves that the comparability measure M is able to quantify
the corpus quality precisely.
4 Discussion
The quality of the comparable corpus is an important feature affecting its usability
in NLP tasks. Our work presented in this paper is a first attempt that systematically
investigates a set of approaches to measure the corpus quality. We would like to
give additional comments on some parts of the work described in previous sections.
In order to evaluate the comparability measures, we have designed in Section 3.2
novel evaluation schema where corpora of gold-standard comparability levels are
produced based on the parallel corpus and another monolingual corpus. We believe
this design to be the most direct one that one can think of in order to obtain
Downloaded from https://www.cambridge.org/core. University of New England, on 03 Mar 2018 at 18:09:25, subject to the Cambridge Core terms of use,
available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S1351324917000481
24 B. Li et al.
5 Conclusion
In this paper, we have first reviewed the notion of comparability in light of the
usage of various bilingual corpora. This notion motivates the introduction of several
comparability measures, as well as experiments designed to validate them. We find
from those experiments that the measure M, together with the P/A weighting scheme,
correlates well with gold-standard comparability levels and is robust to the dictionary
coverage. Moreover, this measure has a low computational cost. The measure M
is then validated in bilingual lexicon extraction to show its correlation with the
performance of real tasks. We have made in this paper two main contributions:
(1) A systematic approach to test if a comparability measure is reliable. We could
not find in previous work such strategies that provide quantitative evaluation for
a comparability measure, since it is difficult to construct comparable corpora of
Downloaded from https://www.cambridge.org/core. University of New England, on 03 Mar 2018 at 18:09:25, subject to the Cambridge Core terms of use,
available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S1351324917000481
Measuring bilingual corpus comparability 25
known comparability levels. (2) A comparability measure that correlates well with
gold-standard comparability levels and is robust to dictionary coverage.
References
Abdul-Rauf, S., and Schwenk, H. 2009. On the use of comparable corpora to improve
SMT performance. In Proceedings of the 12th Conference of the European Chapter of the
Association for Computational Linguistics, pp. 16–23.
Bahdanau, D., Cho K., and Bengio, Y. 2015. Neural machine translation by jointly learning
to align and translate. In Proceedings of the 3rd International Conference on Learning
Representations, San Diego, CA, pp. 1–15.
Ballesteros, L., and Croft, W. B. 1997. Phrasal translation and query expansion techniques for
cross-language information retrieval. In Proceedings of the 20th ACM SIGIR, Philadelphia,
Pennsylvania, USA, pp. 84–91.
Blei, A., and Jordan, I. 2003. Latent dirichlet allocation. Journal of Machine Learning Research
3: 993–1022.
Boyd-Graber, J., and Blei, D. M. 2009. Multilingual topic models for unaligned text. In
Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence (UAI-2009) , pp.
75–82.
Chebel, M., Latiri, C., and Gaussier, E. 2017. Bilingual lexicon extraction from comparable
corpora based on closed concepts mining. In Proceedings of the 21st Pacific-Asia Conference
on Knowledge Discovery and Data Mining, Jeju, Korea, pp. 586–598.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. 1990. Indexing
by latent semantic analysis. Journal of the American Society for Information Science 41(6):
391–407.
Deshmukh, A., and Hegde, G. 2012. A literature survey on latent semantic indexing.
International Journal of Engineering Inventions 1(4): 1–5.
Fung, P., and Yee, L. Y. 1998. An IR approach for translating new words from nonparallel,
comparable texts. In Proceedings of the 17th International Conference on Computational
Linguistics, Montreal, Quebec, Canada, pp. 414–20.
Gaussier, E., Renders, J. M., Matveeva, I., Goutte, C., and Déjean, H. D. 2004. A geometric
view on bilingual lexicon extraction from comparable corpora. In Proceedings of the 42nd
Annual Meeting of the Association for Computational Linguistics, Barcelona, Spain, pp.
526–33.
Hazem, A., and Morin, E. 2016. Efficient Data Selection for Bilingual Terminology
Extraction from Comparable Corpora. In Proceedings of the 26th International Conference
on Computational Linguistics: Technical Papers, Osaka, Japan, pp. 3401–11.
Hermann, K. M., and Blunsom, P. 2014. Multilingual Models for Compositional Distributional
Semantics. In Proceedings of the 52nd Annual Meeting of the Association for Computational
Linguistics, Maryland, USA, pp. 58–68.
Hewavitharana, S., and Vogel, S. 2008. Enhancing a statistical machine translation system by
using an automatically extracted parallel corpus from comparable sources. In Proceedings
of the LREC 2008 Workshop on Comparable Corpora.
Ji, H. 2009. Mining name translations from comparable corpora by creating bilingual
information networks. In Proceedings of the 2nd Workshop on Building and Using
Comparable Corpora: from Parallel to Non-parallel Corpora (BUCC-2009), pp. 34–7.
Kilgarriff, A. 2001. Comparing corpora. International Journal of Corpus Linguistics 6: 97–133.
Koehn, P. 2005. Europarl: a parallel corpus for statistical machine translation. In Proceedings
of MT Summit 2005 .
Li, B., and Gaussier, E. 2010. Improving corpus comparability for bilingual lexicon
extraction from comparable corpora. In Proceedings of the 23rd International Conference
on Computational Linguistics, Beijing, China, pp. 644–52.
Downloaded from https://www.cambridge.org/core. University of New England, on 03 Mar 2018 at 18:09:25, subject to the Cambridge Core terms of use,
available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S1351324917000481
26 B. Li et al.
Li, B., Gaussier, E., and Aizawa, A. 2011. Clustering comparable corpora for bilingual
lexicon extraction. In Proceedings of the 49th Annual Meeting of the Association for
Computational Linguistics: Human Language Technologies, Portland, Oregon, USA,
pp. 473–8.
Luong, T., Pham, H., and Manning, C. D. 2015. Bilingual Word Representations with
Monolingual Quality in Mind. In Proceedings of the NAACL Workshop on Vector Space
Modeling for NLP .
Markantonatou, S., Sofianopoulos, S., Spilioti, V., Tambouratzis, G., Vassiliou, M., and
Yannoutsou, O. 2006. Using patterns for machine translation. In Proceedings of the
European Association for Machine Translation, pp. 239–46.
Mathieu, B., Besancon, R., and Fluhr, C. 2004. Multilingual document clusters discovery. In
Proceedings of RIAO. pp. 116–25.
Morin, E., Daille, B., Takeuchi, K., and Kageura, K. 2007. Bilingual terminology mining -
using brain, not brawn comparable corpora. In Proceedings of the 45th Annual Meeting of
the Association for Computational Linguistics, Prague, Czech Republic, pp. 664–71.
Munteanu, D. S., Fraser, A., and Marcu, A. 2004. Improved machine translation performance
via parallel sentence extraction from comparable corpora. In Proceedings of the HLT-
NAACL 2004 , Boston, MA., USA, pp. 265–72.
Munteanu, D. S., and Marcu, D. 2006. Extracting parallel sub-sentential fragments from
non-parallel corpora. In Proceedings of the 21st International Conference on Computational
Linguistics and the 44th annual meeting of the Association for Computational Linguistics,
Sydney, Australia, pp. 81–8.
Ni, X., Sun, J. T., Hu, J., and Chen, Z. 2009. Mining multilingual topics from wikipedia.
In Proceedings of the 18th International Conference on World Wide Web. WWW ’09, pp.
1155–6.
Och, F. J., and Ney, H. 2003. A systematic comparison of various statistical alignment models.
Computational Linguistics 29(1): 19–51.
Papineni, K., Roukos, S., Ward, T., and Zhu, W. J. 2002. Bleu: a method for automatic
evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association
for Computational Linguistics, pp. 311–8.
Pekar, V., Mitkov, R., Blagoev, D., and Mulloni, A. 2006. Finding translations for low-
frequency words in comparable corpora. Machine Translation 20(4): 247–66.
Rapp, R. 1999. Automatic identification of word translations from unrelated English
and German corpora. In Proceedings of the 37th Annual Meeting of the Association for
Computational Linguistics, College Park, Maryland, USA, pp. 519–26.
Rayson, P., and Garside, R. 2000. Comparing corpora using frequency profiling. In Proceedings
of the ACL Workshop on Comparing Corpora, pp. 1–6.
Robitaille, X., Sasaki, Y., Tonoike, M., Sato, S., and Utsuro, T. 2006. Compiling
French-Japanese terminologies from the web. In Proceedings of the 11st Conference
of the European Chapter of the Association for Computational Linguistics, Trento, Italy,
pp. 225–32.
Salton, G., Wong, A., and Yang, C. S. 1975. A vector space model for automatic indexing.
Communications of the ACM 18: 613–20.
Saralegi, X., SanVicente, I., and Gurrutxaga, A. 2008. Automatic extraction of bilingual
terms from comparable corpora in a popular science domain. In Proceedings of the
6th International Conference on Language Resources and Evaluations - Building and Using
Comparable Corpora Workshop.
Schmid, H. 1995. Improvements in part-of-speech tagging with an application to German. In
Proceedings of the ACL SIGDAT-Workshop, pp. 47–50.
Shapiro, S. S., and Wilk, M. B. 1965. An analysis of variance test for normality (complete
samples). Biometrika 52(3): 591–611.
Sharoff, S. 2007. Classifying web corpora into domain and genre using automatic feature
identification. In Proceedings of Web as Corpus Workshop, Louvain-la-Neuve.
Downloaded from https://www.cambridge.org/core. University of New England, on 03 Mar 2018 at 18:09:25, subject to the Cambridge Core terms of use,
available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S1351324917000481
Measuring bilingual corpus comparability 27
Sharoff, S., Rapp, R., and Zweigenbaum, P. 2013. Overviewing Important Aspects of the Last
Twenty Years of Research in Comparable Corpora. In S. Sharoff, R. Rapp, P. Zweigenbaum,
P. Fung (eds.), Building and Using Comparable Corpora. Berlin: Springer-Verlag, pp. 1–17.
Skadina, I., Vasiljevs, A., Skadins, R., Gaizauskas, R., Tufis, D., and Gornostay, T. 2010.
Analysis and evaluation of comparable corpora for under resourced areas of machine
translation. In Proceedings of the 3rd Workshop on Building and Using Comparable Corpora
(LREC-2010), pp. 6–14.
Talvensaari, T., Laurikkala, J., Järvelin, L., Juhola, M., and Keskustalo, H. 2007. Creating and
exploiting a comparable corpus in cross-language information retrieval. ACM Transactions
on Information Systems 25(1): 4.
Upadhyay, S., Faruqui, M., Dyer, C., and Roth, D. 2016. Cross-lingual models of word
embeddings: an empirical comparison. In Proceedings of the 54th Annual Meeting of the
Association for Computational Linguistics, Berlin, Germany, pp. 1661–1670.
Vulic, I., and Moens, M. F. 2015. Bilingual word embeddings from non-parallel document-
aligned data applied to bilingual lexicon induction. In Proceedings of the 53rd Annual
Meeting of the Association for Computational Linguistics, pp. 719–725.
Washtell, J. 2009. Co-dispersion: a windowless approach to lexical association. In Proceedings
of the 12th Conference of the European Chapter of the Association for Computational
Linguistics, pp. 861–9.
Downloaded from https://www.cambridge.org/core. University of New England, on 03 Mar 2018 at 18:09:25, subject to the Cambridge Core terms of use,
available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S1351324917000481