Machine Translation - From Real Users To Research
Machine Translation - From Real Users To Research
Machine Translation:
From Real Users
to Research
Springer
eBook ISBN: 3-540-30194-1
Print ISBN: 3-540-23300-8
No part of this eBook may be reproduced or transmitted in any form or by any means, electronic,
mechanical, recording, or otherwise, without written consent from the Publisher
The previous conference in this series (AMTA 2002) took up the theme “From Research
to Real Users”, and sought to explore why recent research on data-driven machine
translation didn’t seem to be moving to the marketplace. As it turned out, the first
commercial products of the data-driven research movement were just over the horizon,
and in the intervening two years they have begun to appear in the marketplace. At the same
time, rule-based machine translation systems are introducing data-driven techniques into
the mix in their products.
Machine translation as a software application has a 50-year history. There are an
increasing number of exciting deployments of MT, many of which will be exhibited and
discussed at the conference. But the scale of commercial use has never approached the
estimates of the latent demand. In light of this, we reversed the question from AMTA
2002, to look at the next step in the path to commercial success for MT. We took user
needs as our theme, and explored how or whether market requirements are feeding into
research programs. The transition of research discoveries to practical use involves tech-
nical questions that are not as sexy as those that have driven the research community and
research funding. Important product issues such as system customizability, computing
resource requirements, and usability and fitness for particular tasks need to engage the
creative energies of all parts of our community, especially research, as we move machine
translation from a niche application to a more pervasive language conversion process.
These topics were addressed at the conference through the papers contained in these pro-
ceedings, and even more specifically through several invited presentations and panels.
The commercial translation community weighed in through the invited presentations
of Ken Rother, CIO of Bowne Global Solutions, and Jaap van der Meer, a founding
partner at Cross Language. Bowne Global Solutions is the largest of the world’s “Big 3”
translation services companies. Cross Language is one of a handful of new consulting
and services companies formed to help customers select, customize and deploy machine
translation. The US Government was represented by Kathy Debolt, Chief of the Army’s
Language Technology Office. Panel discussions included a forward-looking dialog be-
tween current students of translation and computational linguistics. Human translators
as well as current users of machine translation also discussed working with machine
translation.
2004 marked 10 years since the first AMTA conference, held in Columbia, Maryland
in October 1994. With our sixth biennial conference, we returned to the Washington area.
The timing and location of AMTA 2004 were very special to the history of machine
translation. The conference was held at Georgetown University, the site of seminal
operational experiments in machine translation, beginning in 1954. To mark the 50th
anniversary of the realization of automated translation, we included a panel of five of the
pioneers from Georgetown (Tony Brown, Christine Montgomery, Peter Toma, Muriel
Vasconcellos, and Michael Zarechnak), as well as an overview of the beginnings of MT
by John Hutchins.
VI Preface
How are people using MT today? Lots of innovative applications are emerging.
A new event at this conference was a half-day Research and Deployment Showcase.
The showcase gave attendees an opportunity to see operational systems that incorporate
machine translation together with other hardware and software tools to accomplish real-
world tasks. The research portion of the showcase offered a preview of the next generation
of machine translation in operational prototype systems developed by research groups.
One of the founding goals of AMTA was to bring together users, researchers and
developers in an ongoing dialog that gives members of each of these communities a
chance to hear and respond to each others’ concerns and interests. AMTA 2004 made
a special effort to bring users and researchers together with the goal of inspiring and
motivating each.
Acknowledgements may seem like obligatory filler to readers removed by space and
time from the original event, but no organizer can reflect on the process without a deep
sense of gratitude to the team that did most of the real work. AMTA 2004 was informed
and realized by the inspirations and volunteer efforts of:
Additional information about the AMTA, its mission, activities and publications,
can be obtained from the association website: www.amtaweb.org, and from AMTA
Focalpoint Priscilla Rasmussen, 3 Landmark Center, East Stroudsburg, PA 18301, USA;
phone: +1-570-476-8006; fax: +1-570-476-0860; focalpoint@amtaweb.org
Jeffrey Allen
Mycom France
91, avenue de la République
75011 Paris FRANCE
jeffrey.allen@mycom–int.com
1 Background
R.E. Frederking and K.B. Taylor (Eds.): AMTA 2004, LNAI 3265, pp. 1–6, 2004.
© Springer-Verlag Berlin Heidelberg 2004
2 J. Allen
The result of this project is a limited set of 15 Nortel Standard English (NSE)
rules.
Mycom International: Mycom considered these previous implementations in
the telecom field and decided to take a completely different approach to
implementing MT technologies in production needs for translating various
documents. This has been more and example-based, osmosis learning
approach. The Mycom Technical Documentation department was set up in
June 2001. The department manager trained the small team of technical
writers to write in a clear, concise format that would benefit both the
monolingual workflow as well as a potential multilingual workflow. No
mention of CLs was made but rather a simple and basic approach to good
technical writing, which included CL principles. All product documentation
(user, installation, responses for information, pre-sales brochures, technical
overview, etc) was produced through this department, so a standardized
approach was used for creating the product-related documentation.
Significant text-leverage and recycling processes have been used for the
creation of all documentation.
This paper describes the use of MT technologies for translating both pre-sales
marketing and post-sales software deployment documentation in the software product
division of Mycom International.
The software used for this implementation was Reverso Expert (v5), which is based
on the PROMT v5 MT kernel. The implementer on the team at Mycom has significant
experience in using this MT product. In order to be most productive, the team decided
to use an existing method of MT implementation (created by Jeff Allen) as described
elsewhere (Allen, 2001; Allen, 2003; Guerra, 2003).
3 Context
4 Quality
Some translation agencies have disputed in the past that the quality of MT post-
editing output is questionable when it is not conducted by translation professionals.
To clarify the issue of quality in this context, the person involved in this
implementation has several years of experience as a (human) translator and trainer of
professional translators. The person is fully bilingual in both the source and target
languages and provides training courses for industry in both languages on a range of
topics. This person has been with the Mycom International product division since
prototype days and has been involved with all phases of product management with
key customers over the past few years. This experience has included designing,
authoring, and editing an entire range of the company’s user guides and manuals,
marketing brochures, R&D specifications, a range of test documents, and all other
types of customer facing documents. As a subject matter expert in this company’s area
of activity, this person is fully competent to appropriately evaluate and determine, in
both languages, the level of textual quality that is necessary for complete end-to-end
care of Mycom customers.
A study led this past year on human translation speed (Allen, 2004) shows from
answers received from dozens of sources of many types and sizes (translation
agencies, government translation bureaus, translation dept of international orgs, etc,
freelance translators, etc) that the average translator can translate approximately 2400
words per day. These published statistics are used as a comparison baseline for the
present project.
6 Results
6.1 Document 1
6.1.2 Measurements
The MT implementer logged the following statistics to complete the entire translation
job:
15 min: spend time with another subject matter expect to identify the
potential ambiguous terms and decide on the desired translated form for each
of them.
4 J. Allen
6.1.3 Sub-conclusion
3 hours to complete 1500 words.
6.2.2 Measurements
Every detail of the implementation was carefully timed, logged and documented.
0.50 hour: strip out all text in original document (the actual test case
scenarios) that was already in English, in order to only keep the descriptive
text in French.
1 hour 45 min: disambiguation and finishing coding dictionary for this
section of test plan
6.2.3 Sub-conclusion
6.2.3.3 Dictionary
Over 400 dictionary entries were created during the production of the translated
document.
7 Conclusion
An average human translator would normally produce 2400 words per day and thus
take 3-4 days to translate this translation product job from scratch. This
implementation shows that a fully bilingual subject matter expert, with professional
translation experience, combined with excellent mastery of a commercial MT system,
can complete the entire translation workflow of the translation job in slightly more
than 1 day of work. This clearly shows, for this specific language direction, and on a
given MT commercial tool, that the results obtained on production documents by such
an individual can attain production rates that are 25%-30% of the time necessary and
expected via traditional translation methods. The results of the translations have been
clearly recorded in e-mail messages of congratulations on the success of both
translated documents. These e-mails clearly indicate that these types of translation
implementation efforts could enhance the ability for the company to write pre-sales
marketing and post-sales technical documents in one language, and later translate,
adapt, localize them into another language for other customers.
Both translation jobs were conducted by creating a specific custom dictionary for
the technical terminology of the documentation of this company. These dictionaries
6 J. Allen
are usable in the Reverso (v5) family of MT software products. The dictionaries can
also be exported in hard print and softcopy formats.
References
1. Allen, Jeffrey. March 2004. Translation speed versus content management. In special
supplement of Multilingual Computing and Technology magazine, Number 62, March 2004.
http://www.multilingual.com/machineTranslation62.htm
2. Allen, Jeffrey. 2001. Postediting: an integrated part of a translation software program. In
Language International magazine, April 2001, Vol. 13, No. 2, pp. 26-29.
http://www.geocities.com/mtpostediting/Allen-LI-article-Reverso.pdf
3. Allen, Jeffrey. 2003. Post-editing. In Computers and Translation: A Translators Guide.
Edited by Harold Somers. Benjamins Translation Library, 35. Amsterdam: John Benjamins.
(ISBN 90 272 1640 1). http://www.benjanims.com/cgi-bin/t_bookview.cgi?bookid=BTL_35
4. Guerra, Lorena. 2003. Human Translation versus Machine Translation and Full Post-Editing
of Raw Machine Translation Output. Master’s Thesis. Dublin City University.
http://www.promt.ru/news-e/news.phtml?id=293
5. O’Brien, Sharon. An Analysis of Several Controlled English Rule Sets. Presented at
EAMT/CLAW2003, Dublin, Ireland, 15-17 May 2003
http://www.ctts.dcu.ie/presentations.html
A Speech-to-Speech Translation System for
Catalan, Spanish, and English
1 Introduction
In this paper we describe an interlingual speech-to-speech translation system
for Spanish, Catalan and English which is under development as part of the
European Union-funded FAME project5. The system is an extension of the
existing NESPOLE! translation system (cf. [4],[5]) to Spanish and Catalan in
the domain of hotel reservations. At its core is a robust, scalable, interlingual
speech-to-speech translation system having cross-domain portability which al-
lows for effective translingual communication in a multimodal setting. Initially,
the system architecture was based on the NESPOLE! platform. Now the general
architecture integrating all modules is based on an Open Agent Architecture
(OAA)6 [2]. This type of multi-agent framework offers a number of technical
features that are highly advantageous for the system developer and user.
Our system consists of an analyzer that maps spoken language transcriptions
into interlingua representation and a generator that maps from interlingua into
5 FAME stands for Facilitating Agent for Multicultural Exchange and focuses on
the development of multimodal technologies to support multilingual interactions.
http://isl.ira.uka.de/fame/
6
http://www.ai.sri.com/~oaa
R.E. Frederking and K.B. Taylor (Eds.): AMTA 2004, LNAI 3265, pp. 7–16, 2004.
© Springer-Verlag Berlin Heidelberg 2004
8 V. Arranz et al.
3 Language Parsing
Parsing spontaneous speech poses unique challenges, including disfluencies, speech
fragments, and ungrammatical sentences, which are not present in the parsing
of written text. Users of the FAME system should feel free to speak naturally,
without restricting these characteristics of spontaneous speech. Thus, the pars-
ing module must be able to produce a reasonable analysis of all types of spoken
input within the given domain.
The SOUP parser[1] was developed specifically to handle the unique charac-
teristics of spontaneous speech. Typically one would not expect a single parse
tree to cover an entire utterance of spontaneous speech, because such utterances
frequently contain multiple segments of meaning. In an interlingua system, these
segments are called Semantic Dialogue Units (SDUs) and correspond to a sin-
gle domain action (DA). Thus, one of the SOUP features most relevant to the
8
http://www.lc-star.com
10 V. Arranz et al.
FAME system is the capability to produce a sequence of parse trees for each
utterance, effectively segmenting the input into SDUs (corresponding to domain
actions) at parse time.
SOUP is a stochastic, chart-based, top-down parser which efficiently uses very
large, context-free semantic grammars. At run-time the grammars are encoded
as recursive transition networks (RTNs) and a lexicon is produced as a hash table
of grammar terminals. A bottom-up filtering technique and a beam search both
enhance parsing speed during the population of the chart. The beam incorporates
a scoring function that seeks to maximize the number of words parsed while
minimizing parse lattice complexity (approximated by the number of nodes in
the potential parse tree). After the chart is populated, a second beam search
seeks to minimize the number of parse trees per utterance. Only non-overlapping
parse lattices are considered.
In addition, SOUP allows partial parses, i.e., some of the input remains un-
parsed and the entire utterance does not fail. The Interchange Format (IF) gives
a special representation to these fragments of meaning, with the goal that enough
information may be preserved to complete the conversation task, and minimize
frustration for the user. This special domain action signals to the generation
module that only a fragment has been analyzed, and the generation output can
be adjusted accordingly.
An analysis mapper completes the analysis chain by performing formatting
functions on the parser output, to produce standard Interchange Format accord-
ing to the IF specification. These functions include the conversion of numbers
for quantities and prices as well of lists of numerals like those found in credit
card numbers.
In the IF representation the action=e-call-2 stands for the action call and
e-time=following represents the future time conveyed in English by the auxiliary
will and in Catalan and Spanish by the inflected ending -aré. The large amount
of morphological information has to be overcome by the use of an @ rule. These
rules are used when there is one bottom-level token that represents more than
just one IF value. However in this case, this is not the only problem to solve as
e-call-2 and e-time=following are not sister nodes, they appear in different levels.
As a consequence, besides an @ rule we will also use a nested_xxx rule to move
the e-time=argument up one level. The rule to be used is: [e-call-2@nested_e-
time=following] for llamaré.
Another difference is that English modifiers preceed the noun whereas Span-
ish and Catalan modifiers usually follow it. As said before, the grammars were
developed using a corpus of real dialogue data that were useful to decide which
modifiers must be postponed (e.g.: specifiers, colors, sizes). For example, the
adjective referring to the size of the room big preceeds the English noun room
while in the Spanish and Catalan sentences the size follows the noun it modifies.
Word-order is also an item of distinction between English and Romance lan-
guages. English word-order is much more fixed while Spanish and Catalan are
free-word order languages. For instance, English adjuncts have a fixed position
either at the beginning or at the end of the sentence whereas Spanish and Cata-
lan adjuncts are much more flexible and can appear everywhere. Example:
4 Language Generation
The generation module of the translation part of our interlingua-based MT sys-
tem includes a generation mapper and the GenKit generator.
The generation mapper was originally developed for the NESPOLE! system.
It converts a given interchange format representation into a feature structure.
This feature structure then serves as input to the GenKit generator, which was
originally developed at Carnegie Mellon’s Center for Machine Translation[6] and
has been updated and re-implemented since.
GenKit is a pseudo-unification-based generation system. It operates on a
grammar formalism based on Lexical Functional Grammar and consists of context-
free phrase structure description rules augmented with clusters of pseudo-unifi-
cation equations for feature structures. A top-down control strategy is used to
create a generation tree from an input feature structure. The leaves of the gener-
ation tree are then read off as a pre-surface-form generation. Subsequently, this
pre-surface form is passed through a post-processing component, which generates
the actual word strings of the generated output in the correct format.
The generator uses hybrid syntactic/semantic grammars for generating a sen-
tence from an interchange-format feature structure. Generation knowledge em-
ployed in generation with GenKit consists of grammatical, lexical, and morpho-
logical knowledge. The higher-level grammar rules reflect the phrase structure
requirements of the particular language. A lexical look-up program uses lexical
entries for associating (in this case, Catalan or Spanish) words with semantic
interchange format concepts and values. These lexical entries contain not only
the root forms of these words, but are also enriched with, for example, lexical
information pertinent to morphology generation (such as gender information in
the case of nouns) and, in the case of verbs, subcategorization requirements.
Generation of the correct morphological form is performed via inflectional
grammar rules that draw on additional information stored in the lexical entries
(see above). In the more complex case of verb morphology, the correct form is
then retrieved from an additional morphological form look-up table. Such a table
turned out not to be necessary for the other grammatical categories in the case
of Catalan and Spanish. The actual morphological forms are then produced in
the post-processing stage.
speech act and concept combinations, or more specifically geared towards a given
domain action. A smaller number of very general rules were written to cover a
broad spectrum of possible domain actions, sacrificing style in some cases. One
such example is the following rule for give- or request- information+action:
In other words, the desired breadth of coverage this approach achieves some-
times reduces the fluency and naturalness of the generated output to a degree.
On the other hand, more specific rules were written for frequently occurring
interchange format tags to ensure that generation of these is highly fluent and
natural-sounding, and thus stylistically easy to read or listen to. This is the case
of the following full domain-action rule for offer+confirmation:
5 Text-to-Speech Synthesis
The Text-to-Speech (TTS) block in our system converts translated text into
speech. For both Spanish and Catalan, we use the TTS system fully developed
at the UPC. This system is a unit-selection based concatenative speech synthesis
system, an approach that gives high levels of intelligibility and naturalness.
The TTS process can be divided into two blocks: (1) a linguistic block, and
(2) a synthesis block. In the second block there is a concatenation of adjacent
units and some modification (pitch synchronous signal processing on the time
domain) if target prosody and unit attributes do not match (the UPC system
14 V. Arranz et al.
acceptable), where the second one was further divided into Ok+, Ok and Ok-.
Evaluation was separately performed on the grounds of form and content. In or-
der to evaluate form, only the generated output was given to the three different
evaluators10. The evaluation of content took into account both the Spanish in-
put and the English output. Accordingly, the meaning of the evaluation metrics
varies if they are being used to judge either form or content:
Perfect: well-formed output (form) or full communication of speakers’ in-
formation (content).
Ok+/Ok/Ok-: acceptable output, grading from only some minor form er-
ror (e.g., missing determiner) or some minor non-communicated information
(Ok+) to some more serious form or content problems (Ok-).
Unacceptable: unacceptable output, either essentially unintelligible or sim-
ply totally unrelated to the input.
On the basis of the four test dialogues used, the results presented in table 1
were obtained. Regarding form 65,95% of the translations were judged to be
well-formed, 22,87% were acceptable (most of them with minor mistakes) while
only 11,17% were essentially unintelligible. In regard to content, 70,21% of the
translation communicated all the speakers’ information, 9,57% communicated
part of the information the speaker intended to communicate (always allow-
ing the communication to continue, even if some information was missing) and
20,21% failed to communicate any of the information the speaker intended to
communicate (providing either unrelated information or no information at all.
The latter seems to be due to the wider coverage of the test data, both in regards
to syntax and semantics).
With the initial stage of system development nearing successful completion, our
efforts will turn to system evaluation, to confronting the more serious technical
problems which have arisen thus far and to extending the systems both within
the reservations domain and to further travel-related domains. In addition to
evaluating the quality of throughput of MT component using BLEU and NIST
10
Two of these evaluators were familiar with the Interchange Format, but the third
one was not.
16 V. Arranz et al.
References
l. Gavaldà, M.: SOUP: A Parser for Real-world Spontaneous Speech. In Proceedings
of the 6th International Workshop on Parsing Technologies (IWPT-2000), Trento,
Italy (2000)
2. Holzapfel, H., Rogina, I., Wölfel, M., Kluge, T.: FAME Deliverable D3.1: Testbed
Software, Middleware and Communication Architecture (2003)
3. Levin, L., Gates, D., Wallace, D., Peterson, K., Lavie, A., Pianesi, F., Pianta, E.,
Cattoni, R., Mana, N.: Balancing Expressiveness and Simplicity in an Interlingua
for Task based Dialogue. In Proceedings of ACL-2002 workshop on Speech-to-speech
Translation: Algorithms and Systems, Philadelphia, PA, U.S.(2002)
4. Metze, F., McDonough, J., Soltau, J., Langley, C., Lavie, A., Levin, L., Schultz,
T., Waibel, A., Cattoni, L., Lazzari, G., Mana, N., Pianesi, F., Pianta, E.: The
NESPOLE! Speech-to-Speech Translation System. In Proceedings of HLT-2002, San
Diego, California, U.S., (2002)
5. Taddei, L., Besacier, L., Cattoni, R., Costantini, E., Lavie, A., Mana, N., Pianta,
E.: NESPOLE! Deliverable D17: Second Showcase Documentation (2003) In NE-
SPOLE! Project web site: http://nespole.itc.it.
6. Tomita, M., Nyberg, E.H.: Generation Kit and Transformation Kit, Version 3.2,
User’s Manual. Technical Report CMU-CMT-88-MEMO. Pittsburgh, PA: Carnegie
Mellon, Center for Machine Translation (1988)
7. Woszczyna, M., Coccaro, N., Eisele, A., Lavie, A., McNair, A., Polzin, T., Rogina, I.,
Rose, C., Sloboda, T., Tomita, M., Tsutsumi, J., Aoki-Waibel, N., Waibel, A., Ward,
W.: Recent Advances in JANUS : A Speech Translation System. In Proceedings of
Eurospeech-93 (1993) 1295–1298
Multi-Align: Combining Linguistic and
Statistical Techniques to Improve Alignments
for Adaptable MT
1 Introduction
The continuously growing MT market faces the challenge of translating new
languages, diverse genres, and different domains using a variety of available lin-
guistic resources. As such, MT system adaptability has become a sought-after
necessity. An adaptable statistical or hybrid MT system relies heavily on the
quality of word-level alignments of real-world data.
This paper introduces Multi-Align, a new framework for incremental test-
ing of different alignment algorithms and their combinations. The success of
statistical alignment has been demonstrated to a certain extent, but such ap-
proaches rely on large amounts of training data to achieve high-quality word
alignments. Moreover, statistical systems are often incapable of capturing struc-
tural differences between languages (translation divergences), non-consecutive
phrasal information, and long-range dependencies. Researchers have addressed
these deficiencies by incorporating lexical features into maximum entropy align-
ment models [17]; however, the range of these lexical features has been limited
to simple linguistic phenomena and no results have been reported.
R.E. Frederking and K.B. Taylor (Eds.): AMTA 2004, LNAI 3265, pp. 17–26, 2004.
© Springer-Verlag Berlin Heidelberg 2004
18 N.F. Ayan, B.J. Dorr, and N. Habash
2 Multi-Align
Multi-Align is a general alignment framework where the outputs of different
aligners are combined to obtain an improvement over the performance of any
single aligner. This framework provides a mechanism for combining linguistically-
informed alignment approaches with statistical aligners.
Figure 1 illustrates the Multi-Align design. In this framework, different
alignments systems generate word alignments between a given English sentence
and a foreign language (FL) sentence. Then, an Alignment Combiner uses this
Multi-Align: Combining Linguistic and Statistical Techniques 19
If the alignments are forced to be one-to-one, then only the entry with the
highest probability in each row and column is deemed correct. If all aligners
are treated equally, setting is sufficient.
Multi-Align has three advantages with respect to MT systems: ease of adapt-
ability, robustness, and user control.
Ease of Adaptability: Multi-Align eliminates the need for complex mod-
ifications of pre-existing systems to incorporate new linguistic resources. A
variety of different statistical and symbolic word alignment systems may
be used together, such as statistical alignments [16], bilingual dictionaries
acquired automatically using lexical correspondences [11,15], lists of closed-
class words and cognates, syntactic and dependency trees on either side [2,
21], phrase-based alignments [12] and linguistically-motivated alignments [5].
Robustness: Individual alignment systems have inherent deficiencies that
result in partial alignments in their output. Multi-Align relies on the
strengths of certain systems to compensate for the weaknesses of other sys-
tems.
User Control: The effect of different linguistic information is difficult to
observe and control when linguistic knowledge is injected into statistical
maximum entropy models [17]. Multi-Align avoids this problem by helping
users to understand which linguistic resources are useful for word alignment.
Additionally, the contribution of each aligner may be weighted according to
its impact on the target application.
3.1 Parameters
DUSTer’s universal rules require certain types of words to be grouped together
into parameter classes based on semantic-class knowledge, e.g., classes of verbs
including Aspectual, Change of State, Directional, etc. [13]. The parameter
classes play an important role in identifying and handling translation diver-
gences. The current classification includes 16 classes of parameters. Because the
parameters are based on semantic knowledge, the English values can be pro-
jected to their corresponding values in a new language, simply by translating
the words into the other language. For example, the English light verbs be, do,
give, have, make, put, take are translated to Spanish light verbs estar, ser, hacer,
dar, tomar, poner, tener, respectively.
Our feasibility experiment combines GIZA++ [16] and DUSTer. For generating
GIZA++ alignments, we use the default GIZA++ parameters, i.e., the align-
ments are bootstrapped from Model 1 (five iterations), HMM model (five itera-
tions), Model 3 (two iterations) and Model 4 (four iterations).
GIZA++ and DUSTer are given as input alignment systems in the Multi-
Align framework. Since DUSTer provides only partial alignments that are related
to the translation divergences, we are interested in a union of these two systems’
outputs to produce the final set of alignments. Therefore, the confidence values
for each aligner are set to 1 and the confidence threshold is set to 0.6 Figure 5
shows the set of alignments generated by GIZA++ and DUSTer and their com-
bination in Multi-Align for the sentence pair in our example.
6
As a simplifying assumption, we take the feature function to be a constant
value for every and In future work, we will weight each alignment link indepen-
dently using different feature functions.
24 N.F. Ayan, B.J. Dorr, and N. Habash
4 Results
Table 1 summarizes the evaluation results for both sets. The difference be-
tween the Multi-Align and GIZA++ scores are statistically significant at a 95%
confidence level using a two-tailed t-test for all 3 measures.
7
GIZA++ was trained on an English-Spanish training corpus of 45K sentence pairs.
Multi-Align: Combining Linguistic and Statistical Techniques 25
References
1. Peter F. Brown, Stephan A. Della-Pietra, and Robert L. Mercer. The Mathe-
matics of Statistical Machine Translation: Parameter Estimation. Computational
Linguistics, 19(2):263–311, 1993.
2. Colin Cherry and Dekang Lin. A Probability Model to Improve Word Alignment.
In Proceedings of ACL 2003, pages 88–95, 2003.
26 N.F. Ayan, B.J. Dorr, and N. Habash
Ralf D. Brown
Carnegie Mellon University Language Technologies Institute
5000 Forbes Avenue
Pittsburgh, PA 15213-3890 USA
ralf+@cs.cmu.edu
1 Introduction
A key component of any example-based or case-based machine translation system
is a good index of the training instances. For shallow EBMT systems such as
Gaijin [1], Brown’s Generalized EBMT [2], EDGAR [3], or Cicekli and Güvenir’s
Generalized EBMT [4], this index is formed from either the original training
text’s source-language half, or some lightly-transformed version thereof. This
paper addresses such a textual index and discusses how it impacts not only the
speed of the system, but can act as an enabler for additional capabilities which
may improve the quality of translations.
Until recently, our EBMT system used an index based on an inverted file
– a listing, for each distinct word (type) of all its occurences (tokens) in the
corpus. While such an index has several nice properties, including fast incre-
mental updates that permit on-line training with additional examples, it scales
poorly. Lookups not only take time linear in the amount of training text, they
also require O(N) additional working memory, as maximal matches are built by
finding adjacent instances of a pair of words and then extended one word at a
R.E. Frederking and K.B. Taylor (Eds.): AMTA 2004, LNAI 3265, pp. 27–36, 2004.
© Springer-Verlag Berlin Heidelberg 2004
28 R.D. Brown
time. Although the code had been heavily optimized (including a five-instruction
hand-coded inner scanning loop), the system’s overall performance was not ad-
equate for interactive applications with corpora exceeding about five million
words.
The new index format which was selected as a replacement for the existing
code is based on the Burrows-Wheeler Transform, a transformation of textual
data first described in the context of data compression a decade ago [5], and
now the underlying algorithm in many of the best-performing compression pro-
grams such as bzip2 [6]. The following sections give a brief overview of the
BWT, how it was adapted for use in Carnegie Mellon University’s EBMT sys-
tem, the performance improvements that resulted from replacing the index, and
new capabilities enabled by the BWT-based index.
1. Take the input text T (of length N) and make it row 0 of an N-by N matrix
M.
2. For to N – 1, form row of M by rotating T left by places as in
Figure 1.
3. Sort the rows of M lexically, remembering where each row of the sorted
result originated, to form (Figure 2).
4. Due to the manner in which M was constructed, columns 1 through N – 1
of can be reconstructed from column 0 by providing a pointer from each
row to the row which had been immediately below it prior to sorting. This
vector of successor pointers is conventionally called V.
5. Since the first column of now consists entirely of runs of equal elements
in lexicographic order (at most one run per member of the alphabet
one can discard column 0 and represent it by an array C of size which
contains the first row in containing each member of the alphabet.
In practice, neither M nor are ever explicitly constructed. Instead, an
auxiliary array of pointers into T is sorted, with determined by retrieving
T is usually a sequence of bytes, but for EBMT, a sequence of
32-bit word IDs was used.
Together, C and V are the output of the Burrows-Wheeler Transform. These
two arrays lend themselves to highly-effective data compression because the el-
ements of C are monotonically increasing, as are the elements of V within each
range to (see Figure 3. The transform is also reversible by starting
at the position of V to which row 0 of M was sorted and following the succes-
sor links until all elements of V have been visited; this is the final step in the
decompression phase of BWT-based compression algorithms. For this example,
A Modified Burrows-Wheeler Transform 29
itself. Given that the 32-bit IDs we used to represent both words and successor
pointers in the V array provide more than 4,200 million possibilities, 1/4 of
the possible values were reserved as EOR markers. This sets the limits on the
training corpus size at some 3,200 million words in over 1,000 million examples.
Further (precisely because the EBMT system does not operate on text span-
ning a sentence boundary, though it allows for overlapping partial translations
to be merged in generating the final translation [8]), there is no need in this
application to determine what text from the next training example follows an
EOR marker, and it is thus possible omit the portion of V corresponding to
EOR. As a result, the index is no larger than it would have been had the entire
training corpus been treated as a single example. There is also no processing
overhead on finding matching resulting from the EOR markers.
In addition to inserting the EOR markers, the order of words in each training
instance is reversed before adding them to the index. This allows matches to be
extended from left to right even though lookups in the index can only be effi-
ciently extended from right to left. (While the input sentence could be processed
from right to left for EBMT, we also wished to use the same code to compute
conditional probabilities for language modeling.)
To retrieve the training examples containing matching the input to
be translated, iterate over the range of V corresponding to the For each
instance of the follow the pointers in V until reaching an EOR, then
extract the record number from the EOR and retrieve the appropriate training
example from the corpus. The offset of the matched phrase within the example
is determined by simply counting the number of pointers which were followed
before reaching an EOR. The original source-language text is not included in
the stored example since it can be reconstructed from the index if required. We
opted not to store a pointer to the start position of the line in the V array, since
the only use our system has for the unmatched portion of the source text is for
display to the user in error messages, and we can forego display of the entire
source sentence in such cases. For applications which must be able to reconstruct
the entire source sentence or access adjacent sentences, separately-stored start
pointers will be required. Even without a line-start pointer, we are still able to
reconstruct the word sequence from the beginning of a training example to the
end of the matched portion when we need to display an error message; this is
typically the most useful part of the sentence for the user, anyway.
One drawback of the BWT-based index compared to the old inverted file is
that quick incremental updates are not possible, since the entire index needs
to be re-written, rather than chaining new records onto the existing occurrence
lists. Because on-line incremental updates in practice are performed only a small
number of times before there is an opportunity for an off-line update, we have
addressed the issue by using two indices – the large main index which was built
off-line when the system was trained, and a much smaller index containing only
the incremental updates. Although the entire auxiliary index must be re-written
on each incremental update, this can be done quickly due to its small size.
32 R.D. Brown
7 Speed as an Enabler
8 Future Work
References
11. Black, A.W., Brown, R.D., Frederking, R., Singh, R., Moody, J., Steinbrecher, E.:
TONGUES: Rapid Development of a Speech-to-Speech Translation System. In:
Proceedings of HLT-2002: Second International Conference on Human Language
Technology Research. (2002) 183-189
http://www.cs.cmu.edu/~ralf/Papers.html.
12. Graff, D., Cieri, C., Strassel, S., Martey, N.: The TDT-3 Text and Speech Corpus
(1999) http://www.ldc.upenn.edu/Papers/TDTi999/tdt3corpus.ps.
13. Bentley, J., Sedgewick, R.: Fast algorithms for sorting and searching strings. In:
SODA: ACM-SIAM Symposium on Discrete Algorithms (A Conference on Theo-
retical and Experimental Analysis of Discrete Algorithms). (1997)
http://www.cs.princeton.edu/~rs/strings/.
Designing a Controlled Language for the Machine
Translation of Medical Protocols: The Case of English to
Chinese
Abstract. Because of its clarity and its simplified way of writing, controlled
language (CL) is being paid increasing attention by NLP (natural language
processing) researchers, such as in machine translation. The users of controlled
languages are of two types, firstly the authors of documents written in the
controlled language and secondly the end-user readers of the documents. As a
subset of natural language, controlled language restricts vocabulary, grammar,
and style for the purpose of reducing or eliminating both ambiguity and
complexity. The use of controlled language can help decrease the complexity of
natural language to a certain degree and thus improve the translation quality,
especially for the partial or total automatic translation of non-general purpose
texts, such as technical documents, manuals, instructions and medical reports.
Our focus is on the machine translation of medical protocols applied in the field
of zoonosis. In this article we will briefly introduce why controlled language is
preferred in our research work, what kind of benefits it will bring to our work
and how we could make use of this existing technique to facilitate our
translation tool.
1 Introduction
The idea of controlled language (CL) is not a new one. It can be traced back to as
early as the 1930s. It was first proposed for the purpose of encouraging a precise and
clarified way of writing for the better readability of technical documents by humans
(see for example [16]), both by native and non-native speakers, and especially of
English [15]. Thus the users of controlled languages are of two types, firstly the
authors of documents written in the controlled language and secondly the end-user
readers of the documents. During its development, different names have been used for
the English variants, such as “Simplified English”, “Plain English”, “Basic English”
and “Global English”. Though there are some differences among each of these, in this
article we prefer to put them into one category – controlled language, an artificially
defined subset of natural language (here English) as they are all more or less
controlled and share more similarities. In practice, controlled languages are divided
into two major categories: 1) better readability by humans and 2) ease of natural
language processing (NLP) [8]. Our present work focuses on the second category, that
is, natural language processing, and in particular, machine translation. We intend to
design and develop a CL to control the generation of feasible and faithful texts of
R.E. Frederking and K.B. Taylor (Eds.): AMTA 2004, LNAI 3265, pp. 37–47, 2004.
© Springer-Verlag Berlin Heidelberg 2004
38 S. Cardey, P. Greenfield, and X. Wu
medical protocols which have been written by professionals, one of the two types of
users of the CL, this by the means of machine translation for the end-users, the other
type of CL user, being the professional or non-professional medical workers working
in the field of zoonosis. There are a good many CL systems existing around the globe.
Among these, those frequently cited include AECMA Simplified English,
Caterpillar’s CTE, CMU’s KANT system, General Motors’ CASL and LantMark, etc.
All these different systems have made great progress in advancing the research and
practice in controlled languages, and this makes us possible to benefit from them.
are encountered. These differences produce problems both at the vocabulary level and
the grammatical level. However, most of these problems can be controlled or at least
can be made much less complex than those appearing in other (uncontrolled)
documents. In addition, no matter how different the writing style might be, the
general writing method is similar. The usual writing style is to present the step-by-
step procedures as a list, one by one. Even if the procedures are not well listed or their
sentence structures are confused or complex, we can always easily control them.
Finally, controlled language is particularly suitable for multilingual translation if
all the languages concerned can be controlled more or less in the same way; for other
work in this area see for example [11], [13]. While the work described in this paper is
concerned with machine translation of English to Chinese, the same domain (medical
protocols applied in the field of zoonosis) is being studied in respect of machine
translation from French to Arabic. Using controlled language can greatly facilitate the
eventual generation and combination of these four controlled languages. These are the
very reasons why we finally chose to apply CL techniques in our research.
In this section, we will briefly introduce a few of the major differences between
English and Chinese concerned specifically with our work. English and Chinese
belong to different language families. The differences between these two languages
can be found at almost all levels, lexical, syntactic and semantic.
Firstly, unlike most Indo-European languages, Chinese characters do not exhibit
morphological changes to show different grammatical functions. In terms of writing,
there are no separators between the words. This makes Chinese word segmentation
extremely difficult. For the words formed by a Chinese character or characters, it is
hard to tell their grammatical categories only from their morphological forms.
Traditionally, in Chinese dictionaries the grammatical category of a word is not
indicated, and this is for many reasons, one of which is the lack of agreement
concerning the categorization of Chinese word classes. The grammatical category of a
Chinese word can only be distinguished from its place or order in a sentence. Though
word order is also important in English, we can still tell the grammatical category of
most English words from their morphological form out of the sentence context
without much difficulty (though the presence of grammatical category ambiguity, its
recognition and disambiguation is still problematic in NLP). This is usually not the
case for Chinese words. For example, a Chinese word can have the function of a
noun, a verb or even an adjective without any morphological change. Verbs and
adjectives can appear directly as subject or object (usually no matter what constituent
a word is, its morphological form will always stay the same). The person and number
of the subject do not affect the verb at all. Besides, Chinese words do not have
inflexions to mark the number of nouns, as English nouns, or inflexions to show
different tenses and other grammatical functions as that of English verbs and so on.
Thus word order becomes more important in Chinese. In order to convey the
meanings of some of these different grammatical functions, Chinese uses many
auxiliary words, such as structural, interrogative, and aspectual words (also called
particles). For example, it is difficult to say exactly the grammatical category, or even
40 S. Cardey, P. Greenfield, and X. Wu
what the Chinese word refers to without any context, as is shown in the
following sentences [17]:
Unlike English, in Chinese, the verb functions as subject but does not need to
change its form, while the English verb swim has to adding in order to get a legal
grammatical form.
Secondly, there are syntactic differences between the two languages, both at the
phrasal level and sentence level. In English the attributives can be put both in front of
and behind the noun, but in Chinese they are usually in front of the noun, including
some of the relative clauses if these are not translated into several sentences. In our
current study, a large proportion of sentences belong to these two kinds. For example,
5. experimental evidence
6. the use of antibacterial agents
7. patients who refuse surgery.
8. patients who relapse after surgery.
Example 5 shares the same word order in both languages though not necessarily the
same grammatical categories. In example 5 the first word “experimental” is an
adjective in English, but is a noun in Chinese. These kinds of phrases are the easiest
to deal with. Another important point is that when English short relative clauses are
translated into Chinese, they are usually translated into phrases instead of subordinate
clauses as in example 7 and 8. But for some long and complex relative clauses the
common way is to translate them into two or more sentences.
Thirdly, for simple declarative sentences, while English sentences are more
dominated by SVO structure, in Chinese there are often at least three possibilities:
SVO, SOV, OSV (the passive voice is excluded).
9. The doctor examined the patient.
b)
Like most other CL systems, while designing and developing our controlled language,
we concentrate on two aspects: vocabulary and grammar (as style does not greatly
vary, it is not within our major focus). Thus both the source (English) and target
(Chinese) languages will be controlled in these two aspects. Furthermore, in order to
ease the translation from source language to target language, the most similar
structures in both languages are to be preferred; that is, to try to avoid structures
which are difficult to translate or can produce some language-specific ambiguities in
the two languages. As we demonstrated in the previous section, although Chinese
allows SVO, SOV and OSV structures, the accepted structure will be SVO,
conforming to that of the source language, English. The phrases and sentences that are
structurally different from source language (SL) to target language (TL) have to be
specified.
42 S. Cardey, P. Greenfield, and X. Wu
There are three major principles we follow: consistency, clarity and simplicity.
Briefly speaking, consistency means that all the terminology and writing rules should
be consistently used. There should not be any conflicts between the terms and
between the rules. Clarity means that texts to be written should conform to the
formulated rules. If any part of a text does not fit the rules, it should be checked and
clarified. Simplicity means that while there are alternative ways of saying the same
thing, the simplest one and that which is most accountable by our rules should be
selected [9]. The simplest writing style is preferred. All these principles should be
followed in both the source language and in the target language.
4.1 Vocabulary
performed on a patient”. This means that in our lexicon, the common meaning of a
word will sometimes be excluded.
Thirdly, in our corpus, many acronyms are frequently used. In order to avoid
possible ambiguity, the period within and at the end of abbreviations will not be
allowed. That is, acronyms will be written as other ordinary words no matter whether
they are capitalized or not. For example, CT is recommended instead of C.T.; ELISA,
instead of E.L.I.S.A.; also in “10-15 cc” and in “ABP 90-50 mm Hg”, no period will
be used. Acronyms will be treated as single entries. In addition, their meanings will be
constrained in a domain-specific manner.
For example, in our lexicon US refers to ultrasound and not the United States.
Acronyms can be transferred directly without translating into Chinese (which is
common practice) or transferred directly together with the Chinese equivalence in
parentheses.
Fourthly, we have to standardize the orthography of some of the terms. We have
observed in our corpus that many terms have more than one spelling form. The most
frequently occurring are those words ending by either –cal or –ic. For example, we
find serological or serologic and immunological or immunologic
Other variant orthographic forms are for example onchocercosis and
onchocerciasis (which of the English words is preferable is still a
matter of debate between specialists in English speaking countries and specialists in
non-English speaking countries); lymphopaenia and lymphopenia
leukopaenia and leukopenia (British English and American English
respectively). This kind of phenomenon is mainly caused by different habits or by
geographical distinctions. In our lexicon only one form will be allowed, for example,
in the case of –cal and –ic, -cal will be recommended. The others will follow the most
commonly used one found in authorized dictionaries. The standardization of
terminology is necessary, but this work has to be done with the involvement of field
specialists. We have thus worked with them for the standardization of terminology,
including examining the use of jargons. Furthermore, in the longer term for the
eventual machine translation of such controlled protocols, entry of such protocols
involving the choice of the correct spelling should be supported either by on-line
vocabulary checking or by an integrated vocabulary checker.
Finally, in our corpus, figures are frequently used to indicate for example the
quantity of drugs. We thus have to pay special attention to the use of figures.
4.2 Grammar
At the level of grammar, we have two things to do: to control at the phrasal level and
to control at the sentential level. We have carefully examined our corpus, and we have
observed that no matter what kind of style the author employs, the grammatical
structures do not vary greatly. In fact, by themselves they are very limited. Moreover,
the present tense is that which is the most commonly used. The passive voice is not
very frequently used.
44 S. Cardey, P. Greenfield, and X. Wu
5 Future Work
In order not to frustrate authors, one of the two types of users of the CL, with
controlled rules and facilitate their production of texts, we need to develop relative
language checkers [3] to enhance the usefulness of our CL in practice. For example, a
vocabulary checker can be used to check terms, forms and/or acronyms, of which the
latter is particularly important. A grammar checker can be used to check the sentence
or phrase lengths as well as some lexical and grammatical misuses or unacceptable
sentences. A benefit will be that authors can learn the writing rules while they are
using our tool without being trained specifically in the controlled language. That is,
while authors are writing their texts, whenever they enter a word or phrase, the
following potential structure will be suggested directly. The system will automatically
direct the authors and furthermore will check for grammatical errors when any input
is found which does not comply with our controlled rules. Additionally an electronic
dictionary will support the checking and correction of the orthography and also will
aid in finding synonyms.
46 S. Cardey, P. Greenfield, and X. Wu
6 Conclusion
This article has discussed the characteristics of both the natural languages concerned
in our study, English and Chinese, the typical features of medical protocols and the
reasons why we prefer to use controlled language as a basic technique for the machine
translation of medical protocols. We have stated that the benefits of controlled
languages and in particular that are easier to deal with for translation purposes. It is
true that our study can only cover a very small part of the linguistic phenomena
involved and will not be able to deal with the many kinds of linguistic problems
outside the focus of our work. However, what we will have done during this first step
will surely help broaden the future coverage of linguistic phenomena and promote
further research. The grammatical structures we have defined can be applied to
multilingual environments. As they are simple, they consist of the basic language
structures found in all languages. Furthermore, such grammatical structures are thus
easier to be analyzed and transferred from language to language. So, having designed
a CL we can then finally turn our attention to a multilingual environment for machine
translation applied to four different and widely used languages, namely Arabic,
Chinese, English and French and the particular translation couples we are involved
with (English – Chinese, French – Chinese, French – Arabic).
References
[1] Alsharaf, H., Cardey, S., Greenfield, P., Shen, Y., “Problems and Solutions in Machine
Translation Involving Arabic, Chinese and French”, Actes de 1’International Conference
on Information Technology, ITCC 2004, April 5-7, 2004, Las Vegas, Nevada, USA,
IEEE Computer Society, Vol 2, pp.293-297.
[2] Alsharaf, H., Cardey, S., Greenfield, P., “French to Arabic Machine Translation: the
Specificity of Language Couples”, Actes de The European Association for Machine
Translation (EAMT) Ninth Workshop, Malta, 26-27 April 2004, Foundation for
International Studies, University of Malta, pp.11-17.
[3] Altwarg, R., “Controlled languages : An Introduction”
http://www.shlrc.mq.edu.au/masters/students/raltwarg/
[4] Cardey, S., Greenfield, P. “Peut-on séparer lexique, syntaxe, sémantique en traitement
automatique des langues ?”. In Cahiers de lexicologie 71 1997, ISSN 007-9871, 37-51.
[5] Cardey, S., Greenfield, P., “Systemic Language Analysis and Processing”, To appear in
the Proceedings of the International Conference on Language, Brain and Computation, 3-
5 October 2002, Venice, Italy, Benjamins (2004).
[6] Cardey S. “Traitement algorithmique de la grammaire normative du français pour une
utilisation automatique et didactique”, Thèse de Doctorat d’Etat, Université de Franche-
Comté, France, June 1987.
[7] Cardey, S., Greenfield, P., Hong, M-S., “The TACT machine translation system:
problems and solutions for the pair Korean – French”, Translation Quarterly, No. 27, The
Hong Kong Translation Society, Hong Kong, 2003, pp. 22–44.
[8] Namahn, “Controlled Languages, a research note by Namahn”
http://www.namahn.com/resources/documents/note-CL.pdf
[9] Olu Tomori S. H., “The Morphology and Syntax of Present-day English: An
Introduction”, London, HEINEMANN Educational Books Ltd, 1997.
Designing a Controlled Language for the Machine Translation of Medical Protocols 47
[10] Ronald A. Cole et al, Survey of the State of the Art in Human Language Technology -
Section 7.6 (1996), http://cslu.cse.ogi.edu/HLTsurvey/ch7node8.html
[11] Sågvall-Hein, A., “Language Control and Machine Translation”, Proceedings of the
International Conference on Theoretical and Methodological Issues in Machine
Translation (MI- 97), Santa Fe, 1997.
[12] Shen, Y., Cardey, S., “Vers un traitement du groupe nominal dans la traduction
automatique chinois-français”, Ve Congrès International de Traduction, Barcelona, 29-
31 October 2001.
[13] Teruko Mitamura “Controlled Language for multilingual Machine Translation” In
Proceedings of Machine Translation Summit VII, Singapore, September 13-17, 1999
http://www.lti.cs.cmu.edu/Research/Kant/PDF/MTSummit99.pdf
[14] Tenni, J. et al. “Machine Learning of Language Translation Rules”
‘http://www.vtt.fi/tte/language/publications/smc99.pdf
[15] Simplified English, userlab Inc. 1-800-295-6354 (in North America)
http://www.userlab.com/SE.html
[16] Van der Eijk, P., “Controlled Languages and Technical Documentation”, Report of Cap
Gemini ATS, Utrecht, 1998.
[17] Century Publishing Group of Shanghai 2003
Normalizing German and English Inflectional
Morphology to Improve Statistical Word Alignment
1 Introduction
The task of statistical word alignment is to identify the word correspondences that
obtain between a sentence in a source language and the translation of that sentence
into a target language. Of course, fluent translation performed by expert human
translators involves reformulation that obscures word alignment. However, in many
domains, automatically identified word alignments serve as an important source of
knowledge for machine translation.
We describe a series of experiments in which we apply morphological
normalizations to both the source and target language before computing statistical
word alignments. We consider the case of aligning English and German, two closely
related languages that differ typologically in ways that are problematic for current
statistical approaches to word alignment.
We perform a series of experiments using the Giza++ toolkit (Och and Ney, 2001).
The toolkit provides an implementation of IBM Model 1 and Model 4 (Brown et al.,
1993) as well as an HMM-based alignment model (Vogel, Ney and Tillman, 1996),
together with useful metrics of model perplexity. We perform five iterations of IBM
Model 1, which attempts to find simple word translations without consideration of the
position of the words within the sentence. The word-alignment hypotheses yielded by
this first stage serve as input for five iterations of HMM alignments, which in turn
serve as input for five iterations of IBM Model 4. Model 4, which models phenomena
such as the relative order of a head and a modifier, is the most sophisticated model
considered here. Clustering of words was performed using JCLUSTER (Goodman,
2001).
These word alignment models take a naïve view of linguistic encoding. Sentences
are conceived of as little more than a sequence of words, mapped one-to-one or one-
to-N from the source language to the target language. Recent research has attempted
R.E. Frederking and K.B. Taylor (Eds.): AMTA 2004, LNAI 3265, pp. 48–57, 2004.
© Springer-Verlag Berlin Heidelberg 2004
Normalizing German and English Inflectional Morphology 49
2 Morphological Facts
English and German are historically related; both languages are in the Western branch
of the Germanic family of Indo-European. Despite this close historical relation, the
modern-day languages differ typologically in ways that are problematic for statistical
approaches to word alignment.
German has pervasive productive noun-compounding. English displays its
Germanic roots in the analogous phenomenon of the noun group—sequences of nouns
with no indication of syntactic or semantic connection. As a general rule, English
noun groups translate in German as noun compounds. The converse does not always
obtain; German compounds occasionally translate as simple English nouns, other
times as nouns with prepositional, adjectival, or participial modifiers. When using
models such as those of Brown et al. (1993), which allow one-to-one or one-to-N
alignments, we would expect this asymmetry to result in poor alignment when
English is the source language and German is the target language.
The order of constituents within the clause is considerably more variable in
German and long distance dependencies such as relative clause extraposition are more
common than in English (Gamon et al., 2002). In German, so-called separable verb
prefixes may occur bound to a verb or may detach and occur in long distance
relationships to the verb. Adding to the confusion, many of these separable prefixes
are homographic with prepositions.
The languages differ greatly in the richness of their inflectional morphologies.
Both languages make a three way distinction in degree of adjectives and adverbs. In
nominal inflections, however, English makes only a two way distinction in number
(singular vs. plural) whereas German makes a two way distinction in number
(singular and plural), a four way distinction in grammatical case (nominative,
accusative, genitive and dative) and a three way distinction in lexical gender
(masculine, feminine, neuter). Nominal case is realized in the German noun phrase on
the noun, the determiner and/or pre-nominal modifiers such as adjectives. Vestiges of
this case marking remain in the English pronominal system, e.g. I/me/my.
50 S. Corston-Oliver and M. Gamon
The languages have similar systems of tense, mood and aspect. Verbal inflection
distinguishes past versus non-past, with weak vestiges of an erstwhile distinction
between subjunctive and indicative mood. Many complexes of tense, aspect and mood
are formed periphrastically. The most notable difference between the two languages
occurs in the morphological marking of person and number of the verb. Aside from
the irregular verb be, English distinguishes only third-person singular versus non-
third-person singular. German on the other hand distinguishes first, second and third
person by means of inflectional suffixes on the verb. In the data considered here,
drawn from technical manuals, first and second person inflections are extremely
uncommon.
Let us now consider how these linguistic facts pose a problem for statistical word
alignment. As previously noted, the correspondence between an English noun group
and a German noun compound gives rise to an N-to-one mapping, which the IBM
models do not allow. Differences in constituent order, however, are really only a
problem when decoding, i.e. when applying a statistical machine translation system: it
is difficult to model the movement of whole constituents by means of distortions of
words.
The homography of separable prefixes and prepositions adds interference when
attempting word alignment.
The most glaring deficiency of the IBM models in the face of the linguistic facts
presented above concerns related word forms. The models do not recognize that some
words are alternate forms of other words, as opposed to distinct lexical items. To put
this another way, the models conflate two problems: the selection of the appropriate
lexical item and the selection of the appropriate form, given the lexical item.
Since the models do not recognize related word forms, the effect of inflectional
morphology is to fragment the data, resulting in probability mass being inadvertently
smeared across related forms. Furthermore, as Och and Ney (2003) observe, in
languages with rich morphology, a corpus is likely to contain many inflected forms
that occur only once. We might expect that these problems could be resolved by using
more training data. Even if this were true in principle, in practice aligned sentences
are difficult to obtain, particularly for specific domains or for certain language pairs.
We seek a method for extracting more information from limited data using modest
amounts of linguistic processing.
With this brief formulation of the problem, we can now contrast the morphological
operations of this paper with Nießen and Ney (2000), who also consider the case of
German-English word alignment. Nießen and Ney perform a series of morphological
operations on the German text. They reattach separated verbal prefixes to the verb,
split compounds into their constituents, annotate a handful of high-frequency function
words for part of speech, treat multiword phrases as units, and regularize words not
seen in training. The cumulative effect of these linguistic operations is to reduce the
subjective sentence error rate by approximately 11-12% in two domains.
Nießen and Ney (2004) describe results from experiments where sentence-level
restructuring transformations such as the ones in Nießen and Ney (2000) are
combined with hierarchical lexicon models based on equivalence classes of words.
Normalizing German and English Inflectional Morphology 51
4 Data
5 Results
We perform stemming on the English and German text using the NLPWin analysis
system (Heidorn, 2000). In the discussion below we consider the perplexity of the
models, and word error rates measured against a gold standard set of one hundred
manually aligned sentences that were sampled uniformly from the data.
The stemmers for English and German are knowledge-engineered components. To
evaluate the accuracy of the stemming components, we examined the output of the
stemmer for each language when applied to the gold standard set of one hundred
sentences. We classified the stems produced as good or bad in the context of the
sentence, focusing only on those stems that actually changed form or that ought to
have changed form. Cases where the resulting stem was the same as the input, e.g.
English prepositions or singular nouns or German nouns occurring in the nominative
singular, were ignored. Cases that ought to have been stemmed but which were not in
fact stemmed were counted as errors.
The English file contained 1,489 tokens; the German analogue contained 1,561
tokens.1 As Table 2 shows, the effects of the morphological processing were
overwhelmingly positive. In the English test set there were 262 morphological
normalizations, i.e. 17.6% of the tokens were normalized. In German, there were 576
normalizations, i.e. 36.9% of the tokens were normalized. Table 3 presents a
breakdown of the errors encountered. The miscellaneous category indicates places
where unusual tokens such as non-breaking spaces were replaced with actual words,
an artifact of tokenization in the NLPWin system. Compared to Table 1
morphological normalization reduces the number of singletons in German by 17.2%
and in English by 7.8%.
1 Punctuation other than white space is counted as a token. Throughout this paper, the term
“word alignment” should be interpreted to also include alignments of punctuation symbols.
Normalizing German and English Inflectional Morphology 53
In the remainder of this paper we will, for the sake of convenience, refer to this
differential perplexity simply as “perplexity”.
We compute word alignments from English to German and from German to
English, comparing four scenarios: None, Full, NP and Verb. The “None” scenario
establishes the baseline if stemming is not performed. The “Verb” scenario performs
stemming only on verbs and auxiliaries. The “NP” scenario performs stemming only
on elements of the noun phrase such as nouns, pronouns, adjectives and determiners.
The “Full” scenario reduces all words to their citation forms, applying to verbs,
auxiliaries, and elements of the noun phrase as well as to any additional inflected
forms such as adverbs inflected for degree. We remind the reader that even in the
scenario labeled “None” we break contractions into their component parts. The results
of stemming are presented in Figure 1 and Figure 2. For ease of exposition, the axes
in the two figures are oriented so that improvements (i.e. reductions) in perplexity
correspond to bars projected above the baseline. Bars projected below the baseline
have a black top at the point where they meet the base plane. The base pane indicates
the model perplexity when no stemming is performed in either language.
As Figure 1 illustrates, E-G perplexity is improved across the board if stemming is
performed on the target language (German). If no stemming is done on the German
side, stemming on the source language (English) worsens perplexity. Interestingly, the
stemming of German verbs causes the largest improvements across all English
stemming scenarios.
Figure 2 shows a remarkably different picture. If English is the target language,
any stemming on either the German source or the English target yields worse
perplexity results than not stemming at all, with the exception of tiny improvements
when full stemming or verb stemming is performed on English.
The difference between the two graphs can be interpreted quite easily: when the
target language makes fewer distinctions than the source language, it is easier to
model the target probability than when the target language makes more distinctions
than the source language. This is because a normalized term in the source language
will have to align to multiple un-normalized words in the target across the corpus,
smearing the probability mass.
In order to assess the impact of these morphological operations on word alignment
we manually annotated two sets of reference data. In one set of reference data, no
stemming had been performed for either language. In the other set, full stemming had
been applied to both languages. The manual annotation consisted of indicating word
alignments that were required or permissible (Och and Ney 2000, 2003). We then
evaluated the alignments produced by Giza++ for these sentence pairs against the
manually annotated gold standard measuring precision, recall and alignment error rate
(AER) (Och and Ney 2003). Let A be the set of alignments produced by Giza++, S be
the set of sure (i.e. required) alignments and P the set of possible alignments. The
definition of precision, recall and AER is then:
Normalizing German and English Inflectional Morphology 55
The results are presented in Table 4. Full stemming improves precision by 3.5%
and recall by 7.6%. The alignment error rate is reduced from 20.63% to 16.16%, a
relative reduction of 21.67%.
56 S. Corston-Oliver and M. Gamon
Note that the alignment error rates in Table 4 are much larger than the ones
reported in Och and Ney (2003) for the English-German Verbmobil corpus. For the
closest analogue of the Giza++ settings that we use, Och and Ney report an AER of
6.5%. This discrepancy is not surprising, however: Our corpus has approximately
three times as many words as the Verbmobil corpus, more than ten times as many
singletons and a vocabulary that is nine times larger.
6 Discussion
As noted above, the morphological operations that we perform are the complement of
those that Nießen and Ney (2000) perform. In future research we intend to combine
stemming, which we have demonstrated improves statistical word alignment, with the
operations that Nießen and Ney perform. We expect that the effect of combining these
morphological operations will be additive.
Additional work remains before the improved word alignments can be applied in
an end-to-end statistical machine translation system. It would be most unsatisfactory
to present German readers, for example, with only the citation form of words. Now
that we have improved the issue of word choice, we must find a way to select the
contextually appropriate word form. In many instances in German, the selection of
word form follows from other observable properties of the sentence. For example,
prepositions govern certain cases and verbs agree with their subjects. One avenue
might be to apply a transformation-based learning approach (Brill, 1995) to selecting
the correct contextual variant of a word in the target language given cues from the
surrounding context or from the source language.
Acknowledgements. Our thanks go to Chris Quirk and Chris Brockett for technical
assistance with Giza++, and to Ciprian Chelba and Eric Ringger for discussions
regarding the normalization of perplexity.
References
Alshawi, Hiyan, Shona Douglas, Srinivas Bangalore. 2000. Learning dependency translation
models as collections of finite-state head transducers. Computational Linguistics 26(1):
45-60.
Normalizing German and English Inflectional Morphology 57
First, users can monitor and correct the speaker-dependent speech recognition system
to ensure that the text which will be passed to the machine translation component is
completely correct. Voice commands (e.g. Scratch That or Correct <incorrect
text>) can be used to repair speech recognition errors. While these commands are
similar in appearance to those of IBM’s ViaVoice or ScanSoft’s Dragon
NaturallySpeaking dictation systems, they are unique in that they remain usable even
when speech recognition operates at a server. Thus they provide for the first time the
capability to interactively confirm or correct wide-ranging text which is dictated from
anywhere.
Next, during the MT stage, users can monitor, and if necessary correct, one espe-
cially important aspect of the translation -- lexical disambiguation.
R.E. Frederking and K.B. Taylor (Eds.): AMTA 2004, LNAI 3265, pp. 58–63, 2004.
© Springer-Verlag Berlin Heidelberg 2004
System Description: A Highly Interactive Speech-to-Speech Translation System 59
The problem of determining the correct sense of input words has plagued the ma-
chine translation field since its inception. In many cases, the correct sense of a given
term is in fact available in the system with an appropriate translation, but for one
reason or another it does not appear in the output. Word-sense disambiguation algo-
rithms being developed by research groups have made significant progress, but still
often fail; and the most successful still have not been integrated into commercial MT
systems. Thus no really reliable solution for automatic word-sense disambiguation is
on the horizon for the short and medium term.
The result is an utterance which has been monitored and perhaps repaired by the
user at two levels – those of speech recognition and translation. By employing these
interactive techniques while integrating state-of-the-art dictation and machine trans-
lation programs – we work with Philips Speech Processing for speech recognition;
with Word Magic and Lingenio for MT (for Spanish and German, respectively); and
with ScanSoft for text-to-speech – we have been able to build the first commercial-
grade speech-to-speech translation system which can achieve broad coverage without
sacrificing accuracy.
1 A Usage Example
When run on a Motion Computing Tablet PC, the system has four input modes:
speech, typing, handwriting, and touch screen. To illustrate the use of interactive
correction for speech recognition, we will assume that the user has clicked on the
microphone icon onscreen to begin entering text by speaking. The image below
shows the preliminary results after pronunciation of the sentence “The old man sat on
the bank”.
60 M. Dillinger and M. Seligman
Fig. 1.
The results of automatic speech recognition are good, but often imperfect. In this
example “man” was incorrectly transcribed as “band”(Fig. 1.). Accordingly, the user
can perform voice-activated correction by saying “Correct band”. A list of alternative
speech recognition candidates then appear, seen in the image below. The user can
select the correct alternative in this case by saying “Choose one”, yielding a corrected
sentence. (If the intended alternative is not among the candidates, the user can supply
it manually – by typing on a standard keyboard, by using a touch screen keyboard, or
by writing with a stylus for high-accuracy handwriting recognition.)
The spoken (or clicked) “Translate” command (Fig. 2.) provides a translation of the
corrected input, seen below in the Translation window (Fig. 3.). Also provided are a
Back Translation (the translated sentence re-translated back into the original, as ex-
plained above) and an array of Meaning Cues giving information about the word
meanings that were used to perform the translation, seen in the Word Meanings list.
The user can use these cues to verify that the system has interpreted the input as in-
tended.
System Description: A Highly Interactive Speech-to-Speech Translation System 61
Fig. 2.
Fig. 3.
62 M. Dillinger and M. Seligman
In this example (Fig. 3.), synonyms are used as Meaning Cues, but definitions, exam-
ples, and associated words can also be shown. Here the back-translation (“The old
man took a seat in the row”) indicates that the system has understood “bank” as
meaning “row”. Presumably, this is not what the user intended. By clicking on the
word in the Word Meanings window, he or she can bring up a list of alternative word
meanings, as in the image below.
Fig. 4.
When a new word meaning has been chosen from this list, e.g. the “riverbank”
meaning in this case, the system updates the display in all windows to reflect that
change (Fig. 4.). In this example, the updated Spanish translation becomes “El hom-
bre viejo se sentó en la orilla del río”. The corresponding back translation is now
“The old man took a seat at the bank of the river” – close enough, we can assume, to
the intended meaning.
When the user is satisfied that the intended meaning has been correctly understood
and translated by the system, the system’s Send button can be used to transmit the
translation to the foreign-language speaker via instant messaging, chat, or on-screen
display for face-to-face interaction. At the same time, synthesized speech can be gen-
erated, and if necessary transmitted, thus completing the speech-to-speech cycle.
2 Languages
The current version of the system is for English <> Spanish, and a German <> Eng-
lish version is in development. Discussion is also in progress with several vendors of
Japanese MT.
System Description: A Highly Interactive Speech-to-Speech Translation System 63
3 Implementation
School of Computing and Centre for Translation Studies, University of Leeds, LS2 9JT, UK
{debe,eric}@comp.leeds.ac.uk, a.hartley@leeds.ac.uk,
1 Introduction
Automated machine translation evaluation is quicker and cheaper than obtaining hu-
man judgments on translation quality. However, automated methods are ultimately
validated by the establishment of correlations with human scores. Overviews of both
human and automated methods for MT evaluation can be found in [1] and on the
FEMTI 1 website [2]. Although existing automated methods such as BLEU [3] and
RED [4] can produce scores that correlate with human quality judgments, these meth-
ods still require human translations, which are expensive to produce. BLEU requires
up to four human ‘reference’ translations against which MT output is automatically
compared and scored according to modified n-gram precision. The test corpus used
for this research comprised 500 sentences from general news stories, with four human
translations of each. RED, on the other hand, automatically ranks MT output based on
edit distances to multiple reference translations. In [5], 16 human reference transla-
tions of 345 sentences in two language directions were used from the Basic Travel
Expression Corpus [6].
To eliminate the expense of producing human translations, and to investigate the
potential of a more portable method, our aim is to design an automated MT evaluation
system, initially for language pairs in which the target language is English, which
does not require human reference translations. The system will detect fluency errors
1A Framework for the Evaluation of Machine Translation in ISLE.
R.E. Frederking and K.B. Taylor (Eds.): AMTA 2004, LNAI 3265, pp. 64–73, 2004.
© Springer-Verlag Berlin Heidelberg 2004
A Fluency Error Categorization Scheme 65
Many published MT evaluation projects, such as BLEU [3] and the DARPA evalua-
tion series [7] have based their research entirely on newspaper texts. Many subsequent
MT evaluation experiments have also made use of the DARPA corpus, such as [8],
[9], [10], [11], [12] and [13]. Consequently, we conducted a survey of MT users in
2003 to find out which text types were most frequently translated by MT systems. Re-
sponses showed a great difference between the use of MT by companies/organizations
and by individuals who machine translated documents for personal use [14]. It was
found that companies most frequently machine translated user manuals and technical
documents on a large scale. As a result, the decision was taken to collect such texts
for our evaluation research, along with a smaller number of legislative and medical
documents, which also figured highly among survey responses. The resulting multi-
lingual parallel corpus is TECMATE, (a TEchnical Corpus for MAchine Translation
Evaluation), comprising source texts, human and machine translations, and human
scores for fluency and adequacy for an increasing number of texts [15].
The decision to devise a classification scheme of fluency errors stemmed from the
need to identify error types in MT output to guide automated evaluation. Statistics
from the human annotation of MT output using such a scheme would provide infor-
mation on the frequency of error types in texts produced by different MT systems and
would help us select errors for automated detection. Statistics would also enable us to
compare error type frequency with human judgments for fluency and adequacy, ena-
bling us to focus on the detection of those error types whose frequency correlated
with lower human scores for one or both of those attributes.
Fine-grained error classification schemes are not practical for the black-box
evaluation of large numbers of machine translations; such a method is even more
time-consuming than, for instance, the evaluation of fluency or fidelity at segment
level. Consequently, few MT error classification schemes have been devised, and
most have been designed with a particular purpose in mind. The SAE J2450 Quality
Metric, developed by the Society of Automotive Engineers [16], and the Framework
for Standard Error Marking devised by the American Translators Association [17]
were both designed for the evaluation of human translations and are insufficiently
fine-grained for our purpose. Correa’s typology of errors commonly found in auto-
matic translation [18] was also unsuited to our needs, largely because it was designed
for glass-box evaluations during system development. Flanagan’s Error Classification
for MT Evaluation [19] to allow end-users to compare translations by competing sys-
tems, Loffler-Laurian’s typology of errors for MT, based on linguistic problems for
post-editors [20] and classifications by Roudaud et al. [21], Chaumier and Green in
[22] provide a more useful starting point for our work. However, these are still insuf-
66 D. Elliott, A. Hartley, and E. Atwell
ficiently fine-grained for our purpose, all rely on access to the source text, and most
are based on errors found in translations out of English. As our intention is to design
an automated error detection system that does not require access to the original or to
any human translation for comparison, it was essential to devise categories based on
the analysis of MT output in isolation.
Our classification of errors was progressively developed during the analysis and
manual annotation of approximately 20,000 words of MT output, translated from
French into English by four systems (Systran, Reverse Promt, Comprendium and
SDL’s online FreeTranslation2). The four machine translations of twelve texts (each
of approximately 400 words) from the TECMATE corpus were annotated with error
types. The texts comprised three extracts from software user manuals, three FAQs
(frequently asked questions) on software applications, three press releases on techni-
cal topics and three extracts from technical reports taken from the BAF corpus3. All
texts were chosen on the basis that they would be understandable to regular users of
computer applications.
Annotations were made according to items that a post-editor would need to amend
if he/she were revising the texts to publishable quality. Although the source text was
not made available, knowledge of the source language was necessary, as the scheme
requires untranslated words to be annotated with parts-of-speech. Furthermore, it was
important for the annotator to be familiar with the named entities and acronyms (eg.
names of software applications) in the texts, to better represent the end-user and code
these terms appropriately.
Errors were annotated using the Systemic Coder4, a tool that supports hierarchical
linguistic coding schemes and enables subsequent statistical analyses. Error types
were divided according to parts-of-speech, as this would provide more detailed in-
formation for analysis and would enable us to make more informed decisions when
selecting and weighting errors for our automated system. As the Coder supports the
insertion of new nodes into the hierarchy at any time, this facilitated the progressive
data-driven refinement of the coding scheme. For example, after annotating around
1,000 words, a decision was taken to sub-divide ‘inappropriate’ items (see Figure 1)
into ‘meaning clear’, ‘meaning unclear’ and ‘outrageous’ (words with an extremely
low probability of appearing in a particular text type and subject area). This refine-
ment would enable us to make better comparisons between MT systems, and isolate
those errors that have a greater effect on intelligibility.
During these initial stages of analysis, it became clear that, having set out to anno-
tate fluency errors, adequacy errors were also detectable as contributors to disfluency,
despite the absence of the source text. Words or phrases that were obviously incorrect
in the given context were marked as ‘meaning unclear’ and can be seen as both flu-
ency and adequacy errors. For this research, we can, therefore, define each annotated
error as a unit of language that surprises the reader because its usage does not seem
natural in the context in which it appears.
2 http://www.freetranslation.com/
3 http://www-rali.iro.umontreal.ca/arc-a2/BAF/Description.html
4 http ://w w w. wagsoft. com/Coder/index. html
A Fluency Error Categorization Scheme 67
The current scheme contains all error types found in the French-English MT output.
However, the organization of categories reflects the constraints of the tool to a certain
extent. It was noticed during the annotation process that items often involved two and,
in rare instances, three error types. For example, a noun could be ‘inappropriate’, its
position within the phrase could be incorrect and it could lack a required capital letter,
or a verb could be ‘inappropriate’ and the tense also incorrect. The scheme was, there-
fore, organized in such a way that the tool would allow all of these combinations of
categories to be assigned to the same word or group of words where necessary.
A Fluency Error Categorization Scheme 69
Statistics from our annotations were compared with human evaluation scores to ex-
plore correlations between the number of errors annotated and intuitive judgments on
fluency and adequacy. Each of the 48 machine translations was evaluated by three
different judges for each attribute. Texts were evaluated at segment level on a scale of
1-5, using metrics based on the DARPA evaluations [7]. For fluency, evaluators had
access only to the translation; for adequacy, judges compared candidate segments
with an aligned human reference translation. A mean score was calculated per seg-
ment for each attribute. These scores were then used to generate a mean score per text
and per system. Methods and results are described in [15].
Assuming that all error types in the classification scheme affect fluency, we ini-
tially compared the total number of errors per system with human fluency scores. We
then removed all error categories that were considered unlikely to have an affect on
adequacy (such as ‘inappropriate’ items with a clear meaning, unnecessary items, in-
appropriate prepositions and determiners, omitted determiners, incorrect positions of
words, spelling errors, case errors and incorrect verb tense/mood or conjugation, the
majority of these being an inappropriate present tense in English). The remaining
classification of adequacy errors was then compared with the adequacy scores from
our human evaluations, as shown in Table 2.
Human fluency scores and the number of annotated fluency errors rank all four
systems in the same order. The picture is slightly different for adequacy, with Systran
and Reverso competing for the top position. We calculated Pearson’s correlation coef-
ficient r between the human scores and the number of errors per system for each at-
tribute. A very strong negative correlation was found between values: for fluency the
value of r = -0.998 and for adequacy r = -0.997. Of course, only four pairs of vari-
ables are taken into consideration here. Nevertheless, results show that we have man-
70 D. Elliott, A. Hartley, and E. Atwell
We computed error frequencies for the four text types. For each of the four systems,
user manuals were annotated with the largest number of errors, followed by FAQs,
technical reports and finally, press releases. The number of fluency errors and the sub-
set of adequacy errors were then compared with human scores for fluency and ade-
quacy according to text type. No significant correlation was found. In fact, human
scores for fluency were highest for user manuals for all systems, yet these texts con-
tained the largest number of annotated errors. It is clear, therefore, that errors must be
weighted to correlate with intuitive human judgements of translation quality. The two
main reasons for the large number of errors annotated in the user manuals were (i) the
high frequency of compound nouns (eg. computer interface items and names of soft-
ware applications), which, in many cases, were coded with two error types (eg. inap-
propriate translations and word order) and (ii) the high number of inappropriately
translated verbs, which although understandable in the majority of cases, were not
correct in the context of software applications (eg. leave instead of quit or exit, regis-
ter or record instead of save etc.) Furthermore, user manuals were annotated with the
largest number of untranslated words, yet many of these were understandable to
evaluators with no knowledge of French, having little or no adverse effect on ade-
quacy scores. A further experiment showed that 58% of all untranslated words in this
study were correctly guessed in context by three people with no knowledge of French.
In fact, 44% of these words, presented in the form of a list, were correctly guessed out
of context.
The eight most common error types (from a total of 58 main categories) were found to
be the same for all four systems, although the order of frequency differed between
systems and text types. The frequency of these eight errors represents on average 64%
of the total error count per system.
Table 3 shows that only in the case of inappropriate verbs (2) and inappropriate
prepositions (6) does the total number of errors correspond to the rank order of the
four systems according to human scores for fluency. The number of inappropriate
noun string content errors (7) corresponds to human rankings for adequacy. Further-
more, the frequency of very few error types in the entire scheme corresponds to hu-
man rankings of the four systems for either fluency or adequacy. It is also clear from
Table 3 that the frequency of particular errors within a given text type does not repre-
sent system performance as a whole.
Findings show that, while the frequencies of the above eight error types are signifi-
cant, detecting a small number of errors to predict scores for a particular text type or
system is not sufficient. Quality involves a whole range of factors – many of which
must be represented in our automated system. Furthermore, our intention is to build a
A Fluency Error Categorization Scheme 71
tool that will provide information on error types to help users and developers, rather
than merely a mechanism for producing a raw system score. It is clear, therefore, that
a number of different error categories should be selected for detection, based on their
combined frequencies, and on their computational tractability; we still need to deter-
mine which error types could be detected more successfully.
References
1. White, J.S.: How to evaluate machine translation. In Somers, H. (ed.): Computers and
translation: a translator’s guide. J. Benjamins, Amsterdam Philadelphia (2003) 211-244
2. FEMTI: A Framework for the Evaluation of Machine Translation in ISLE:
http://www.issco.unige.ch/projects/isle/femti/ (2004)
3. Papineni, K., Roukos, S., Ward, T., Zhu, W.: Bleu: a Method for Automatic Evaluation of
Machine Translation. IBM Research Report RC22176. IBM: Yorktown Heights, NY
(2001)
4. Akiba, Y., Imamura, K., Sumita, E.: Using multiple edit distances to automatically rank
machine translation output. In: Proceedings of MT Summit VIII, Santiago de Compostela,
Spain (2001)
5. Akiba, Y., Sumita, E., Nakaiwa, H., Yamamoto, S., Okuno, H.G.: Experimental Compari-
son of MT Evaluation Methods: RED vs. BLEU. In: Proceedings of MT Summit IX, New
Orleans, Louisiana (2003)
6. Takezawa, T., Sumita, E., Sugaya, F., Yamamoto, H., Yamamoto, S.: Toward a broad-
coverage bilingual corpus for speech translation of travel conversations in the real world.
In: Proceedings of the Third International Conference on Language Resources and
Evaluation (LREC), Las Palmas, Canary Islands, Spain (2002)
7. White, J., O’Connell, T., O’Mara, F.: The ARPA MT evaluation methodologies: evolution,
lessons, and future approaches. In: Proceedings of the 1994 Conference, Association for
Machine Translation in the Americas, Columbia, Maryland (1994)
8. Rajman, M., Hartley, A.: Automatically predicting MT systems rankings compatible with
Fluency, Adequacy or Informativeness scores. In: Proceedings of the Fourth ISLE Evalua-
tion Workshop, MT Summit VIII, Santiago de Compostela, Spain (2001)
9. Rajman, M., Hartley, A.: Automatic Ranking of MT Systems. In: Proceedings of the Third
International Conference on Language Resources and Evaluation (LREC), Las Palmas,
Canary Islands, Spain (2002)
10. Vanni, M., Miller, K.: Scaling the ISLE Framework: Validating Tests of Machine Transla-
tion Quality for Multi-Dimensional Measurement. In: Proceedings of the Fourth ISLE
Evaluation Workshop, MT Summit VIII, Santiago de Compostela, Spain (2001)
11. Vanni, M., Miller, K.: Scaling the ISLE Framework: Use of Existing Corpus Resources for
Validation of MT Evaluation Metrics across Languages. In: Proceedings of the Third In-
ternational Conference on Language Resources and Evaluation (LREC), Las Palmas, Ca-
nary Islands, Spain (2002)
12. White, J., Forner, M.: Predicting MT fidelity from noun-compound handling. In: Proceed-
ings of the Fourth ISLE Evaluation Workshop, MT Summit VIII, Santiago de Compostela,
Spain (2001)
13. Reeder, F., Miller, K., Doyon, K., White, J.: The Naming of Things and the Confusion of
Tongues. In: Proceedings of the Fourth ISLE Evaluation Workshop, MT Summit VIII,
Santiago de Compostela, Spain (2001)
A Fluency Error Categorization Scheme 73
14. Elliott, D., Hartley, A., Atwell, E.: Rationale for a multilingual corpus for machine trans-
lation evaluation. In: Proceedings of CL2003: International Conference on Corpus Lin-
guistics, Lancaster University, UK (2003)
15. Elliott, D., Atwell, E., Hartley, A.: Compiling and Using a Shareable Parallel Corpus for
Machine Translation Evaluation. In: Proceedings of the Workshop on The Amazing Utility
of Parallel and Comparable Corpora, Fourth International Conference on Language Re-
sources and Evaluation (LREC), Lisbon, Portugal (2004)
16. SAE J2450: Translation Quality Metric, Society of Automotive Engineers, Warrendale,
USA (2001)
17. American Translators Association, Framework for Standard Error Marking, ATA Ac-
creditation Program, http://www.atanet.org/bin/view/fpl/12438.html (2002)
18. Correa, N.: A Fine-grained Evaluation Framework for Machine Translation System Devel-
opment. In: Proceedings of MT Summit IX, New Orleans, Louisiana (2003)
19. Flanagan, M.: Error Classification for MT Evaluation. In: Technology Partnerships for
Crossing the Language Barrier, Proceedings of the First Conference of the Association for
Machine Translation in the Americas, Columbia, Maryland (1994)
20. Loffler-Laurian, A-M.: Typologie des erreurs. In: La Traduction Automatique. Presses
Universitaires Septentrion, Lille (1996)
21. Roudaud, B., Puerta, M.C., Gamrat, O.: A Procedure for the Evaluation and Improvement
of an MT System by the End-User. In: Arnold D., Humphreys R.L., Sadler L. (eds.): Spe-
cial Issue on Evaluation of MT Systems. Machine Translation vol. 8 (1993)
22. Van Slype, G.: Critical Methods for Evaluating the Quality of Machine Translation. Pre-
pared for the European Commission Directorate General Scientific and Technical Infor-
mation and Information Management. Report BR 19142. Bureau Marcel van Dijk (1979)
Online MT Services and Real Users’ Needs:
An Empirical Usability Evaluation
Federico Gaspari
1 Introduction
One of the most interesting areas in the current development of Machine Translation
(MT) is the presence on the Internet of a number of on-line services, some of which
are available free of charge1. MT technology has been available on the Internet for a
few years now and since several language combinations are covered at the moment,
on-line MT services are becoming increasingly popular and are used on a daily basis
by a growing number of people in different ways for a variety of purposes. Due to
this interest, recent studies have looked at Internet-based MT technology from a range
of perspectives, emphasizing the challenges, potential and versatility of MT applica-
tions in the on-line environment ([1], [2], [3], [4], [5], [6], [7], [8], [9], [10]).
Surprisingly, however, to date no attempt has been made to investigate the real us-
ers’ needs with a view to enhancing the performance of on-line MT services, e.g. by
promoting a more usable and user-oriented approach to their design. In order to look
1 The five on-line MT services considered in this study are Babelfish, Google Translate,
Freetranslation, Teletranslator and Lycos Translation. They are all available free of charge
on the Internet and Table 1 in the Appendix below provides their URLs (this information is
correct as of 14 May 2004).
R.E. Frederking and K.B. Taylor (Eds.): AMTA 2004, LNAI 3265, pp. 74–85, 2004.
© Springer-Verlag Berlin Heidelberg 2004
Online MT Services and Real Users’ Needs: An Empirical Usability Evaluation 75
at this under-researched area, the investigation presented in this paper has considered
the performance of five free on-line MT systems2. A small-scale evaluation based on
some key usability factors has been conducted to assess how successfully users can
take advantage of web-based MT technology. Whilst most ordinary people and Inter-
net users today tend to associate MT software with (free) web-based MT services like
the ones that have been considered in this study, not much research seems to be
geared towards making them more easily accessible and user-friendly. This paper
argues that this is a necessity that should be given priority to bridge the gap between
the MT industry and researchers on the one hand and end-users on the other.
Section 2 below explains what key web usability factors have been considered in
the evaluation and sheds some light on their relevance for on-line MT services. Sec-
tion 3 presents a few small-scale evaluation procedures that are based on the usability
criteria that have been previously introduced and comments on the results of the em-
pirical evaluation, laying particular emphasis on the real needs of people who use on-
line MT technology. Finally, Section 4 draws some conclusions and suggests possible
lines of practical action for future improvements to the design of on-line MT services
on the basis of real users’ needs.
Web usability has recently received attention as a discipline that offers valuable in-
sights into the subtleties of meeting the needs and expectations of users who interact
with on-line applications (see e.g. [11], [12], [13], [14]). In empirical terms a web-site
or an Internet-based service (such as any on-line MT system) that is highly usable is
easy to interact with and makes it possible for users to achieve their goals effectively
and efficiently.
The investigation presented in this paper evaluates the impact of some key web us-
ability criteria on the design of web-based MT systems from the point of view of
users, and discusses the extent to which they affect successful interaction and as a
result the level of user satisfaction. The assumption behind this approach is that im-
plementing a usability-oriented approach to the design of web-based MT services has
a crucial importance for the widespread acceptance and success of on-line MT tech-
nology among users.
Guessability and learnability are two basic notions that fall under the umbrella of web
usability and will be explained in more detail in this section. Guessability refers to the
effort required on the part of the user to successfully perform and conclude an on-line
2 All the information presented here regarding the evaluation procedures and the tests referred
to later in the paper is based on experiments carried out by the author on the Internet and is
correct as of 14 May 2004.
76 F. Gaspari
task for the first time, and directly depends on how intuitive and predictable its design
is – both in terms of the graphic user interface and of the interaction procedures in-
volved ([15]:11-12).
Learnability, on the other hand, involves the effort and time required on the part of
the users to familiarize themselves with the satisfactory operation of a web-based
application after they have used it already at least once ([15]:12-13). Learnability,
then, is directly linked with how intuitive and memorable completing an on-line task
is for users.
Guessability has a crucial importance the first time an Internet user tries to access
an on-line MT service, and as a result is of particular interest for novice users who
have never been exposed to a particular web-based MT system before. Learnability,
on the other hand, has a longer-term relevance since it refers to how easily users of
any level of experience can memorise and retain the procedures that are necessary to
interact with the on-line MT technology, and accordingly follow them correctly in the
future – this applies in particular to returning users. The two interrelated notions of
learnability and guessability presented here also allude to the likelihood to make
mistakes while users operate an on-line application. As a result, they lay emphasis on
the need to provide straightforward ways in which users can recover from errors
made during on-line interaction.
Different user profiles obviously require different levels of support and guidance
while operating MT systems, as has been argued in [16]:227 with particular reference
to the appropriateness of off-line MT software documentation for various types of
users. It seems equally desirable that on-line MT services provide support that is
useful for users with a range of skills and levels of expertise, from novice to highly
experienced.
3 This function of on-line MT services is of interest here, i.e. when users supply the URL of a
whole web-page they would like to translate. The other option usually available in web-
based MT systems is the translation of plain text that can be copied and pasted or typed into
an appropriate box, but this latter mode of use will not be taken into consideration here. Each
of the five free on-line MT services considered in this study offers both these options.
Online MT Services and Real Users’ Needs: An Empirical Usability Evaluation 77
Web-surfers who have only a very poor knowledge of a foreign language may still
prefer to read MT output in their own mother tongue (provided this is available),
checking the original source document from time to time for a more thorough under-
standing when necessary. In such cases, Internet users with limited linguistic skills in
more than one language may use the machine-translated document as a partial aid to
grasp the general information on a web-page, with the option to refer back to the
original text, whose contents they may have some ability to read and understand.
Under such circumstances users can supplement the inadequacies in the performance
of the system, e.g. where the output is of particularly poor quality, and for example
some specialized jargon that is not in the MT system’s lexicon may be understood by
the user in the foreign language.
At this stage of the discussion, following [17] a useful distinction should be made
between on-demand and on-the-fly machine translation services that operate on-line.
On-demand MT “occurs when a user initiates some action to cause a page to be
translated. The user interface to receive the user’s translation request can be, but is
not limited to, a button or a hyperlink” ([17]:3). The so-called on-the-fly machine
translation, on the other hand, takes place “automatically, meaning there is no explicit
user interaction, using the user’s browser’s language preference or the language pref-
erences defined by the translation system’s administrator” (ibid.).
On a similar level, some on-line MT services perform “one-time translation”,
which causes ”the page itself to be translated, but any other pages accessed from this
page (such as through hyperlinks) will not be translated but displayed in the language
in which they were authored” ([17]:4). On the other hand, web-based MT systems
performing “continuous translation” (in other words, an implicit on-line MT of hy-
perlinks) offer the possibility for “the page itself to be translated as well as all subse-
quent pages through link modification. [...] With all of the links modified in this
manner, continuous translation can be achieved” (ibid.).
Since the reading process in the on-line environment is largely based on hypertex-
tuality, it is essentially non-linear, as opposed to what usually happens with most
traditional printed, paper-based texts. The focus of this research is on texts in digital
form contained in web-pages (irrespective of their length), since they are typically fed
into on-line MT systems to serve as input.
Internet users regularly take advantage of links to navigate the Web, so as to move
around from one page to another of the same Internet site, or indeed to visit some
external web-sites. As a result, while users read on-line texts, they typically follow
non-linear threads with very unsystematic and fragmented browsing patterns that
cannot be pre-determined or predicted in any way.
Web-surfers who request on-line MT for a web-page from language A into lan-
guage B almost certainly prefer to keep on reading MT output in language B, if they
click on a hypertext anchor that points them to another web-page whose contents are
written in language A. Allowing users to do so is bound to greatly enhance the quality
78 F. Gaspari
of their browsing experience, but on-line MT services do not always allow continuous
translation of linked web-pages, thus somehow contradicting the principle that the
reading process mostly follows a non-linear pattern on the Internet. The difference
between having access to continuous translation as opposed to one-time translation
seems to have a crucial impact on the degree of usability that users experience during
their interaction with on-line MT services.
The previous section has focused on web usability and in particular on a limited set of
key factors (e.g. guessability and learnability) that will be considered here in more
detail. This section will in fact try to establish some direct links with on-line MT
services, emphasizing the impact of web usability issues on typical users’ needs in
terms of interaction design. Against this background, the purpose of this section is to
examine and evaluate the relevance to web-based MT systems of the web usability
factors identified above. A number of small-scale evaluation procedures are proposed
mainly on a comparative basis for the benefit of clarity.
The approach to this evaluation of on-line MT services does not aim to establish
which are the best solutions or strategies among those reviewed. In other words, there
is no explicit or implicit intention to suggest that a particular on-line MT service is
ultimately better than another because it seems to offer a more satisfactory perform-
ance according to the specific usability criteria that will be looked at. Rather, here it
will be sufficient to briefly discuss some crucial features that have a bearing on the
use of on-line MT technology, without attempting a real final classification of pre-
ferred strategies. This loose approach to evaluation is partially due to the fact that
there are no set standards or benchmarks available yet to formally evaluate the level
of usability of on-line MT services. As a result, this innovative approach to the
evaluation of the users’ needs of on-line MT services will inevitably need to be fur-
ther refined and improved in the future.
It is however interesting to note that [18], a general textbook on Machine Transla-
tion, does briefly mention usability when covering the evaluation of MT and com-
puter-assisted translation (CAT) products on the basis of the attention that this con-
cept received in [19], a highly influential report concerning the standards for the
evaluation of NLP systems. Even though [18]:262 offers a brief illustration of the
main implications of usability in general for PC-based MT and CAT software, it does
not cover any aspect of web usability as such that may be related to the design of and
interaction with on-line MT services. Similarly, in the context of a discussion devoted
to the possibility of automating evaluation procedures for MT, [20]: 107 refers to the
concept of usability in connection with MT, but again this term is applied to off-line
MT software packages.
As a result, care should be taken not to confuse the general concept of usability re-
ferred to in [18] and [20], which refers to software products in general, with the more
specific one of web usability, as explained in Section 2 above, which is of particular
interest in the present study focusing on Internet-based MT services.
Online MT Services and Real Users’ Needs: An Empirical Usability Evaluation 79
The guessability and learnability factors are most suitable to be measured on a com-
parative basis in empirical web usability inspections, since it is very hard to find ef-
fective ways to evaluate them in absolute terms. As a result, the degree of guessability
and learnability would be typically measured by testing and comparing the perform-
ance of two or more similar on-line applications on equivalent tasks.
Due to reasons of space, here it seems sufficient to provide a practical appraisal of
these factors by briefly showing how they can be greatly enhanced by looking in
particular at some features of one free on-line MT service (i.e. Babelfish, see [1], [7]).
For the sake of clarity and in order to avoid unwieldy reports that would exceed the
purpose of this discussion, Babelfish will be considered here for the purpose of illus-
trating a simple but successful strategy aimed at increasing the degree of guessability
and learnability for the benefit of all its users, and novice or unexperienced users in
particular.
The on-line submission forms used by the five popular free web-based MT serv-
ices considered here seem to be equally intuitive, in that they all request the same
type of information from the users by means of similar procedures. It is worth point-
ing out, though, that Babelfish does seem to offer a clever guessability and learnabil-
ity advantage if compared to the other free on-line systems, which is crucial in terms
of enhancing the overall degree of usability. As a matter of fact, Babelfish shows a
series of succinct user tips in the on-line MT request form, just under the field where
URLs should be entered.
These practical suggestions are displayed on a rotation basis and aim at familiaris-
ing users with procedures or simple tricks that enhance the performance of the MT
service, thus maximising its usability in the interest of net-surfers. When a user logs
on to the Babelfish web-site, a number of different tips appear on a rotation basis and
are visible under one of the two drop-down menus to select the language combination
for the translation job (Figures 1, 2 and 3 in the Appendix show three of these tips).
This strategy gives users effective practical advice to improve their interaction
with the on-line MT system without requiring too much effort and background
knowledge, since the tips are very short, clearly written in plain non-technical lan-
guage and explicitly goal-oriented. Of the five free on-line MT systems reviewed in
this research, Babelfish is the only one that presents this approach to the enhancement
of guessability and learnability, thus presumably lending itself to more successful
operation, especially by novice, “one-off” and casual users (cf. [16]:223, 225).
This part of the discussion tries to assess the practical implications that the parallel
browsing of a machine-translated web-page and its original source document can
have from the point of view of on-line MT users. It could well be the case, for in-
stance, that an Internet user who visits a web-site or web-page that is available exclu-
sively in English only has a superficial knowledge of this language.
80 F. Gaspari
As a result, in spite of not being very confident with English, they may know or
understand words that appear in the original web-pages that they are visiting, and
recognise phrases or specialised terms that the on-line MT service may not be able to
translate correctly but that the user may be familiar with, if for example they are
words belonging to a specific field of interest of the user (e.g. leisure information
regarding a hobby or professional content). As a result, in such cases users may want
to have a machine-translated version into another language to read, but it may some-
times be helpful for them to go back to the original, so as to check unclear key-words
(e.g. those appearing in hyperlinks, if they want to decide whether to click on the
hypertext anchor or not), or specific parts of the text on the web-page.
An in-built facility to make such a “parallel” process possible is available only in
the environment of Babelfish, Google Translate and Teletranslator, whereas the other
web-based MT services that have been reviewed for this evaluation (i.e. Freetransla-
tion and Lycos Translation) do not offer it. Figures 4, 5 and 6 in the Appendix show
details of the screenshots of Babelfish, Google Translate and Teletranslator, respec-
tively, i.e. a top frame that is automatically added by these on-line MT services into
the browser window to inform users that they can click on a link/button to view the
web-page in the original language, while the bottom frame in the rest of the browser
window (not included in the figures) shows the translated version.
If users click on the appropriate link/button of the top frame, a new small browser
window pops up on the screen, showing the corresponding original source document.
As a result, the reader may refer to the original web-page in the source language,
whilst the machine-translated web-page is still displayed in the background. In this
way the Internet user can easily switch back and forth between the two parallel ver-
sions, if they want to do so.
The practice of browsing in parallel an original web-page and its machine-
translated version into the target language selected by the user may be an effective
strategy for disambiguation purposes, since on-line MT engines are usually designed
to cope with general-purpose language and standard vocabulary. However, in spite of
having a limited and superficial multilingual knowledge, Internet users may be able to
understand in a variety of languages technical terms, acronyms and jargon usage
specific to their own field of interest.
Where the MT system’s performance is likely to be poor or the output in the target
language is unclear or full of blatant errors, users may still in some cases be able to
grasp an indication of the gist of on-line texts by also looking at the original lan-
guage. The link/button in the top frame shown in Figures 4, 5 and 6 in the Appendix
offers a straightforward feature to navigate original web-pages and their machine-
translated versions in parallel, if users prefer to do so. This neat feature may in some
cases greatly benefit multilingual Internet users who find MT useful as an aid to in-
crease their understanding of the contents of web-pages, but are also able to compen-
sate for its shortcomings, and is a clever implementation of usability.
Online MT Services and Real Users’ Needs: An Empirical Usability Evaluation 81
The previous section has attempted to present some simple procedures to evaluate the
degree of usability of five of the most popular free on-line MT services, according to
the key factors that had been previously identified in Section 2, discussing the main
implications of the results. The list of criteria that have been covered is by no means
complete, and the evaluation tests were mainly based on superficial observations and
brief comparisons.
In spite of this rather informal methodological approach to evaluation, the previous
section provides evidence for the fact that usability criteria affect the overall quality
of on-line MT services, as well as the extent to which users can successfully interact
with them for the purpose of translating entire web-pages.
82 F. Gaspari
4.1 Conclusions
In summary, two main general conclusions can be drawn on the basis of the evalua-
tion that has been presented here. First of all, different approaches seem to be adopted
by various on-line MT services with respect to the role played by single usability
criteria, which have a number of implications from the point of view of the users’
needs. As a matter of fact, the fragmented picture given above shows that no specific
preferences emerge at the moment regarding the design of user interaction for web-
based MT systems.
It should be noted, however, that there are some practices that seem to be fairly
standardised, as is for instance the case for the procedures that users are expected to
follow in order to access and start the on-line MT procedures. At the same time,
though, in a number of other areas that play a crucial role for the user interaction with
the web-based application, diverging strategies are adopted by different systems.
Secondly, no general consensus or established rules exist at present as to what are
the best strategies to successfully offer MT technology in the on-line environment,
with special reference to the translation of entire web-pages. This situation seems to
emphasize the need for a rapid development of specific reliable evaluation criteria,
benchmarks and standards, that may encourage user-centred approaches to the design
of on-line MT applications to be implemented in the short term.
Along the lines of the approach adopted for this research on a very small scale, em-
pirical data derived from formal usability inspections and a range of more in-depth
evaluation tests would provide a solid platform to pursue the enhancement of on-line
MT services, focusing on a larger set of key factors that are crucial for users.
As far as the preliminary results presented in this research are concerned, the in-
vestigation of a few key usability factors has shown that simple principles and design
guidelines greatly enhance the overall performance of on-line MT services. Web
usability is quickly reaching a state of maturity in other widespread Internet-based
applications, especially those connected with e-commerce web-sites and remotely-
managed financial services and business transactions. In these areas user-centred
design is at present widely advocated as a major concern.
Web-based MT services, on the other hand, do not currently seem to be adequately
designed to meet the specific needs of users in the Internet environment. Web-surfers
have demands and priorities that need to be catered for by on-line MT technology, if
this is to be successful as a useful resource to navigate the World Wide Web. Looking
at how the sample of free on-line MT services considered here are designed today, it
is easy to notice that many improvements at various levels are still needed.
This study suggests in conclusion that prompt action should be taken in order to
apply usability principles to the design and development of web-based MT services.
The previous discussion has identified some important areas that are crucial to the
users of Interned-based MT technology, and some of the indications provided by this
Online MT Services and Real Users’ Needs: An Empirical Usability Evaluation 83
study may be fed back into the research and development of on-line MT services to
enhance their design, thus ensuring that they meet the real needs of their growing
population of users.
References
1. Yang, J., Lange, E.: SYSTRAN on AltaVista: A User Study on Real-Time Machine
Translation on the Internet. In: Farwell, D., Gerber, L., Hovy, E. (eds.): Machine Transla-
tion and the Information Soup. Lecture Notes in Artificial Intelligence, Vol. 1529.
Springer-Verlag, Berlin Heidelberg New York (1998) 275-285
2. Hutchins, J.: The Development and Use of Machine Translation Systems and Computer-
based Translation Tools. In: Zhaoxiong, C. (ed.): Proceedings of the International Confer-
ence on Machine Translation and Computer Language Information Processing, 26-28 June
1999, Beijing, China (1999) 1-16 [Available on-line at the URL
http://ourworld.compuserve.com/homepages/WJHutchins/Beijing.htm - Accessed 14 May
2004]
3. Miyazawa, S., Yokoyama, S., Matsudaira, M., Kumano, A., Kodama, S., Kashioka, H.,
Shirokizawa, Y., Nakajima, Y.: Study on Evaluation of WWW MT Systems. In: Proceed-
ings of the Machine Translation Summit VII. MT in the Great Translation Era, 13-17 Sep-
tember 1999, Singapore (1999) 290-298
4. Macklovitch, E.: Recent Trends in Translation Technology. In: Proceedings of the
International Conference The Translation Industry Today. Multilingual Documentation,
Technology, Market, 26-28 October 2001, Bologna, Italy (2001) 23-47
5. O’Connell, T.: Preparing Your Web Site for Machine Translation. How to Avoid Losing
(or Gaining) Something in the Translation (2001) [Available on-line at the URL
http://www-106.ibm.com/developerworks/library/us-mt/?dwzone=usability - Accessed 14
May 2004]
6. Zervaki, T.: Online Free Translation Services. In: Proceedings of the Twenty-fourth Inter-
national Conference on Translating and the Computer, 21-22 November 2002, London.
Aslib/IMI, London (2002)
7. Yang, J., Lange, E.: Going Live on the Internet. In: Somers, H. (ed.): Computers and
Translation. A Translator’s Guide. John Benjamins, Amsterdam Philadelphia (2003) 191-
210
8. Gaspari, F.: Enhancing Free On-line Machine Translation Services. In: Lee, M. (ed.):
Proceedings of the Annual CLUK Research Colloquium, 6 7 January 2004, University
of Birmingham (2004) 68-74
9. Gaspari, F.: Integrating On-line MT Services into Monolingual Web-sites for Dissemina-
tion Purposes: an Evaluation Perspective. In: Proceedings of the Ninth EAMT Workshop
Broadening Horizons of Machine Translation and Its Applications, 26-27 April 2004,
Foundation for International Studies, University of Malta, Valletta, Malta (2004) 62-72
10. Gaspari, F.: Controlled Language, Web Usability and Machine Translation Services on the
Internet. In: Blekhman, M. (ed.): International Journal of Translation. A Half-Yearly Re-
view of Translation Studies. Special Number on Machine Translation, Vol. 16, No. 1, Jan-
June 2004. Bahri Publications, New Delhi (2004) 41-54
11. Nielsen, J.: Designing Web Usability. The Practice of Simplicity. New Riders Publishing,
Indiana (2000)
84 F. Gaspari
12. Krug, S.: Don’t Make Me Think. A Common Sense Approach to Web Usability. New
Riders Publishing, Indiana (2000)
13. Nielsen, J., Tahir, M.: Homepage Usability. 50 Websites Deconstructed. New Riders Pub-
lishing, Indiana (2002)
14. Brinck, T., Gergle, D., Wood, S.D.: Designing Web Sites That Work. Usability for the
Web. Morgan Kaufmann Publishers, San Francisco (2002)
15. Jordan, P.W.: An Introduction to Usability. Taylor & Francis, London (1998)
16. Mowatt, D., Somers, H.: Is MT Software Documentation Appropriate for MT Users? In:
White, J.S. (ed.): Envisioning Machine Translation in the Information Age. Lecture Notes
in Artificial Intelligence, Vol. 1934. Springer-Verlag, Berlin Heidelberg New York (2000)
223-238
17. Sielken, R.: Enabling a Web Site for Machine Translation Using WebSphere Translation
Server (2001) [Available on-line at the URL
http://www7b.boulder.ibm.com/wsdd/library/techarticles/0107_sielken/0107sielken.html
Accessed 14 May 2004]
18. Trujillo, A.: Translation Engines. Techniques for Machine Translation. Springer-Verlag,
Berlin Heidelberg New York (1999)
19. Expert Advisory Group on Language Engineering Standards (EAGLES): Evaluation of
Natural Language Processing Systems Final Report. Technical Report EAG-EWG-PR.2
(Version of October 1996). ISSCO University of Geneva (1996) [Available on-line at the
URL http://www.issco.unige.ch/projects/ewg96/ewg96.html - Accessed 14 May 2004]
20. White, J.S.: Contemplating Automatic MT Evaluation. In: White, J.S. (ed.): Envisioning
Machine Translation in the Information Age. Lecture Notes in Artificial Intelligence, Vol.
1934. Springer-Verlag, Berlin Heidelberg New York (2000) 100-108
Appendix 4
4 Disclaimer: the information presented in Table 1 is correct as of 14 May 2004. The screen-
shots and the figures presented in the Appendix have been downloaded from the Internet by
the author of this paper on 14 May 2004.
Online MT Services and Real Users’ Needs: An Empirical Usability Evaluation 85
5 It should be noted that this user tip refers to the on-line machine translation of passages of
plain text, since otherwise it is impossible for users to add extra characters or symbols to the
text contained in a web-page, in order to keep a word or proper name in the source document
untranslated (e.g. “Pink Floyd”). In spite of not being directly applicable to the on-line
translation of entire web-pages, this tip is shown here to provide a more general idea of the
information supplied to users by Babelfish.
Counting, Measuring, Ordering: Translation Problems
and Solutions
Abstract. This paper describes some difficulties associated with the translation
of numbers (scalars) used for counting, measuring, or selecting items or
properties. A set of problematic issues is described, and the presence of these
difficulties is quantified by examining a set of texts and translations. An
approach to a solution is suggested.
1 Introduction
Translation of numbers would seem to be a particularly easy task. After all, 15 is 15,
and even if fifteen is not quinze there is a one-to-one correspondence between them.
However, as Tom Lehrer would be quick to point out, is not the same as We
are lucky that the decimal system is widespread and that representation in other bases
is usually explicitly noted.
In this paper, we focus on some difficult cases, where numbers may not be all that
they seem, particularly as far as translation goes. We focus on the use of numbers in
text, to count, to measure, to order or select. In previous work, we have discussed in
detail a particular case of selection (Farwell and Helmreich, 1997, 1999). Two
translators decided differently about which floor-naming convention to use in
translating into English a Spanish newspaper article that mentioned the “segundo
piso” and the “tercero piso”. One translated these phrases as “second floor” and
“third floor” while the other translated them as “third floor” and “fourth floor.” We
argued that both translations were correct, and justified on the basis of default
reasoning about the convention likely to be used by the author of the article and the
convention likely to be inferred by the audience of the translation.
Section 2 defines the scope of the issue highlighted in that article. Section 3
presents an incomplete list of specific problematic issues that fall within this scope. In
Section 4, we examine a set of texts and translations to try to determine the
quantitative extent of the issue. We suggest some approaches to the solution in
Section 5.
R.E. Frederking and K.B. Taylor (Eds.): AMTA 2004, LNAI 3265, pp. 86–93, 2004.
© Springer-Verlag Berlin Heidelberg 2004
Counting, Measuring, Ordering: Translation Problems and Solutions 87
There appear to be three human activities that cover the scope of this problem:
counting, measuring, and ordering or selecting. The activities attempt to answer
questions such as “How many?” (counting); “How much?” (measuring) and
“Which?” (selecting). In each case there is a domain which consists of some thing
which is counted, measured, or selected. For counting and selecting, it is usually a set
of objects, for measuring, a property. There is a unit of measurement. For counting
and selecting, it is usually an individual in the set. For measuring, the unit of
measurement is usually specific to the property. There is a set of counters, which may
be a finite set, a countably infinite set, or an uncountably infinite set. For counting and
selecting, the standard and productive set is the set of positive integers, or the natural
numbers. For measuring, it is real numbers (though in practice the rationals are
usually sufficient). For each counter, there is a set of linguistic representations or
names. Usually there is more than one linguistic representation for each counter.
There is also an ordering on the set of counters, which allows for specifying a range
or specific subset of the things counted. Finally, there is a starting point for the
ordering. The order and starting point may be specified by convention, as in the
differing floor-naming conventions, or derived from the context, as in (1)
(1) The people in the picture are listed counterclockwise, starting in the upper left.
For the purposes of this paper, let us call a set consisting of these six elements
{domain, unit of measurement, {counters}, {{linguistic representations}}, ordering,
starting point} a scalar system. This information is summarized in Table 1.
For the purposes of this paper, we shall regard scalar systems that differ in domain,
unit of measurement, counting set, order, or starting point as different scalar systems.
Systems that differ only in the names assigned to the set of counters are not regarded
as different scalar systems. Scalar systems are thus not restricted to a particular
language.
Translation difficulties may arise if there is more than one scalar system in use,
particularly if the correlation between the two systems is complex. Problematic cases
also arise within a single scalar system if the sets of names are not disjoint, that is, if
the same name is connected to more than one counter. We list a number of such
difficulties and problematic cases, by no means complete. It is left to the reader to
apportion the difficulty to one or more of the elements of the scalar systems.
88 S. Helmreich and D. Farwell
notation for days, months, and years. Since two calendars are lunar and two are solar,
there is not a simple equivalent between the two.
Even within the same calendar system, there may be complications due to the
names associated with particular dates. For instance, standard German notation is
day/month/year, while standard American notation is month/day/year.
Time measurement. There are two standard systems in use: am/pm and military.
Measures of space: length, area, volume, and weight. There are two standard
systems in use: metric and English. They differ in units of measurement.
Measures of temperature. There are three systems in use: Celsius, Fahrenheit,
and Kelvin. There is a difference in the size of the measuring unit (degrees) between
Celsius and Kelvin on the one hand, and Fahrenheit on the other. The difference
between Celsius and Kelvin is a difference in starting point.
Floor-naming conventions. Which floor counts as “the first floor” determines the
name identify of the other floors. There are also some slight variations in that many
US buildings do not have a floor, while Japanese buildings often do not have a
floor. Thus the floor name (elevator number name) and the actual floor number may
vary as well.
Alphabetization conventions. Each language has different conventions for
ordering the letters of their alphabet. Thus in Spanish, vowels accented with an acute
accent are alphabetized with the vowel itself, while ñ and double ll are given separate
status.
Musical works. Musical works by a single composer are often enumerated by
Opus number, usually assigned by the composer. However, for the works of a number
of composers, have been catalogued by scholars and assigned a numerical tag, again
usually in chronological order. However, as new works are discovered, other works
assigned to other composers, different scholars may propose different enumerations.
So, there is the S listing and the BWV listing for Bach’s works, The K and the L
listing for Scarlatti’s works, etc.
Book cataloguing schemas. There are two systems in common use: Dewey
Decimal and Library of Congress. It is interesting to note that the Dewey Decimal
system has a non-standard ordering. A book catalogued as 251.35 is filed after a
booked catalogued as 251.4.
In addition to the difficulties caused by the scalar systems themselves, additional
translation difficulties are caused by two common language strategies: (a) ellipses of
repeated material; and (b) omission of shared mutual knowledge. These linguistic
strategies make it difficult to identify exactly which scalar system is in use. Examples
(2) and (3) illustrate strategy (a). Examples (4) through (6) exemplify strategy (b).
(1) John will be four years old in Saturday, and his brother will be two on
Tuesday.
(2) That long board measures 6 feet long, while the smaller one is only 4.
(3) Whew it’s hot out there. It must be at least 80 degrees in the shade.
(4) In 1960, Kennedy was elected president.
(5) In Ottowa, that would cost you $10.
A final translation difficulty arises in selecting an appropriate scalar system for the
target audience. In general, there are four translation strategies. Which one is used
will depend on knowledge of and inferences about the scalar systems in use by the
target audience, decisions about the purpose of the translation, and possibly other
pragmatic factors.
90 S. Helmreich and D. Farwell
(A) A literal translation of the source language scalar system would convey to the
reader a more exact understanding of the original text. It may also serve the purpose
of highlighting the foreign nature of the original text or setting, as in example (7):
(6) “I’ll give you a thousand ducats for that horse!” cried the Sultan.
(B) A translation into the target language scalar system will communicate more
directly the intended content of the source language text, as in (9) as opposed to (8).
(7) The orchestra played Bach’s “Messe in H-moll”.
(8) The orchestra played Bach’s “Mass in b minor”.
(C) A translation strategy may provide both a literal translation of the original scale
and a translation into the target language scalar system, as in (10):
(9) The shipment cost 43,000 yuans (4,940 dollars).
(D) A final translation strategy tries to overlook the exact scale and provides a general
equivalent, if the exact amount or number is not vital to the understanding of the text,
as in (11) through (13). Here the original text “decenas” means literally “tens”.
(10) (Original) decenas de empresas occidentales
(11) (Translation 1) dozens of Western businesses
(12) (Translation 2) numerous Western businesses
4 Empirical Evidence
Given the multiple scalar systems in use as described in Section 3, given the common
occurrence of language strategies such as ellipses and omission, and given the
omnipresence of four translation strategies, it would seem that translation of scalar
system references would be difficult indeed. On the other hand, it is possible that the
multiple systems listed above are in infrequent use, and in the vast majority of
instances, a simple word-for-word, number-for-number translation is effective and
correct.
We examined 100 Spanish newspaper texts each with two professional English
translations (provided for the DARPA MT evaluation in 1994) (White et al., 1994).
We isolated all the scalar phrases in the texts. These were any phrase that contained a
number, whether spelled out or written in numerals, as well as any phrase that
specified a member or set of members of a scalar system by reference to that system.
That would include phrases like “on Wednesday”, “last week,” or “the majority of the
commission,” “both men.” We did not include phrases that identified a specific
individual through the use of definite or indefinite articles, nor did we include
quantified phrases in general. We did more extensive work on the first 15 texts.
The first 15 texts contained a total of 5759 words. We isolated from these texts 222
scalar phrases, containing 796 words, or 13.8% of the total. This is the percentage of
text that could, a priori, exhibit translation difficulties due to scalar systems. Of these
222 scalar phrases, 115 exhibited a difference between the two English translation, or
about 51.8%. These 115 phrases accounted for 388 words or about 48% of the total.
(This last figure may be somewhat misleading, as the 796 words of the scalar phrases
were counted in the Spanish original text, while the 388 words of difference were
counted from one of the English translations.)
Of the 222 scalar phrases in the original, we identified 44, or 19.8% of the total,
where the translation was not a simple. Fifteen of the 44 involved a simple
substitution of period for comma or vice versa as in (14) through (17).
Counting, Measuring, Ordering: Translation Problems and Solutions 91
(13)Original: 31,47%
(14)Translation: 31.47%
(15)Original: unos 300.000 empleados
(16)Translation: some 300,000 employees
21 cases also involved the period/comma alternation, but in a more complex fashion.
Apparently in Spanish, it is standard to speak about thousands of millions of dollars.
English standard is to speak of billions of dollars. Thus in these cases, the period was
not transposed into a comma. Instead, it was kept and the word “millones” was
translated as “billion” as in (18) and (19).
(17) Original: 2.759 millones de dólares
(18) Translation: 2.759 billion dollars
Six additional cases involved scalar phrases using the metric system. In all six, both
translators opted for translation strategy (A) – leaving the translation in the metric
system as in (20) through (22).
(19)Original: a unos 30 km
(20) (21 Translation 1: some 30 km. away
(21)Translation2: about 30 km away
Two final cases involved the floor-naming convention discussed as discussed in
Farwell and Helmreich, 1997.
Generalizing, we estimate that about 13.8% of text involves scalar phrases and thus
potential translation difficulties, and that 19.8% of those (or about 2.73% of total text)
actually does exhibit these difficulties. Furthermore, the languages and cultures
involved (Spanish and English) are not too different. One would expect a greater
number of difficulties in translation of scalar phrases the more distant the source and
target languages and cultures are.
Good translators, of course, can insure that these translation difficulties do not
cause translation problems. However, even these professional translators in six cases
produced semantically incompatible translations. Two of the cases involved the floor-
naming convention, as shown in (23) through (28). This appears to be the result of
different decisions about which floor-naming conventions were in use.
(22) Original: tercero piso
(23) Translation 1: third floor
(24) Translation2: fourth floor
(25) Original: segundo piso
(26) Translation2: third floor
(27) Translation 1: second floor
One case appeared to involve a choice by one translator to use translation strategy
(D). This was shown in examples (11)-(13), repeated here as (29) through (31):
(28) (Original) decenas de empresas occidentales
(29) (Translation 1) dozens of Western businesses
(30) (Translation 2) numerous Western businesses
Two more cases appeared to be perhaps the result of interference from the scalar
systems involved, resulting in errors, as shown in (32) through (37).
(31) Original: 16 millones de dólares
(32) Translation 1: 16 billion dollars
(33)Translation2: 16 million dollars
(34)Original: 2.069 millones de dólares
(35)Translation1: 2.069 billion dollars
(36)Translation2: 2,069 billion dollars
92 S. Helmreich and D. Farwell
The final case appears to be a simple error on the part of one translator, as shown in
(38) through (40).
(37)Original: 8.879 millones de dólares
(38)Translation1: 8.870 billion dollars
(39)Translation2: 8.879 billion dollars
5 Solutions
In Helmreich and Farwell (1998), we argued strongly that in all such cases, a deep-
pragmatic solution is the “correct” one. That is, the correct solution examines the
meaning of the text as a whole, and then based on knowledge of and inference about
the intent of the source language author, the understanding of the source language
audience, the purpose of the translation activity, and knowledge of the target language
audience, determines an appropriate translation. We do not propose to re-argue that
case here. Suffice it to point that that scalar systems are clearly culture-specific, which
may or may not coincide with linguistic boundaries.
Instead, we attempt to provide some interim solutions that might assist in getting
the translation “right” without requiring an intensive knowledge-based translation
system that relies on defeasible reasoning.
In this section, however, we provide an algorithm that relies on more readily
accessible information and does not require a non-monotonic inferencing engine.
Step 1: Identify the scalar phrase. This step is fairly straightforward as these
phrases are nearly co-extensive with phrases that include a number.
Step 2: Identify the domain. This second step is not quite as easy. Frequently
measuring phrases include a unit of measurement, which can be tied to a property. If a
whole number is involved, counting may be involved, and the objects counted should
be mentioned in the phrase.
Step 3: Identify the scalar system. Having identified the domain, a restricted set
of scalar systems are connected with each domain. An additional clue may be found
in the source language, which may be indicative of certain scalar systems. The
linguistic representations of the counters are often unique, or may help narrow the
plausible set of scalar systems. For example, “January” would identify the Julian or
Gregorian calendar system, and eliminate the Jewish or Arabic ones.
It is at this step that some complexity may be added to the algorithm by hard-
coding inferences made on the basis of sets of clues. The clues can be simple, such as
the identity of the source language, or the geographical origin of the source text. Or it
could be more complex. For example, the object or objects possessing a measured
property often have more or less standard ranges for that property, ranges that are
dependent on the scale. For instance, the ambient temperature on earth usually ranges
between –20 and 120 degrees Fahrenheit, and between –30 say, and 55 degrees
Celsius. A temperature between 55 and 120 that is predicated of the ambient
temperature is likely to be in degrees Fahrenheit.
Step 4: Select a scalar system for the target language text. This is the first step
in choosing a translation strategy. The easiest choice is the same as that one used in
the source language text. However, that may not be the best choice. Since the target
language and presumably the target language audience are known prior to translation,
an acceptable list of scalar systems can be chosen beforehand, for each domain.
Counting, Measuring, Ordering: Translation Problems and Solutions 93
6 Conclusion
In this paper, we have first provided a framework that describes scalar systems and
with which it is possible to classify differences in scalar systems. Second, we have
shown that there are a number of such scalar systems, and that in many cases,
different scalar systems are in use in different languages and cultures. Third, we have
analyzed a number of texts and translations to ascertain the frequency of occurrence
of problematic scalar phrases. We discovered that scalar phrases occupy about 13% of
text and that 20% of that (about 2.5% of all text) is problematic with respect to
translation. Finally, though we believe that the best solution involves knowledge-
based inferencing, we offer an algorithm, which, though not fully satisfactory, can
help avoid some of the pitfalls of translating phrases involving scalar systems.
References
1. Farwell, D., and S. Helmreich. 1997. What floor is this? Beliefs and Translation.
Proceedings of the 5th International Colloquium on Cognitive Science, 95-102.
2. Farwell, D., and S. Helmreich. 1999. Pragmatics and Translation. Procesamiento de
Lenguaje Natural, 24:19-36.
3. Helmreich, S. and D. Farwell. 1998. Translation Differences and Pragmatics-Based MT.
Machine Translation 13:17-39.
4. White, J., T. O’Connell and F. O’Mara. 1994. “The ARPA MT Evaluation Methodologies:
Evolution, Lessons, and Future Approaches” in Technology Partnerships for Crossing the
Language Barrier: Proceedings of the First Conference of the Association for Machine
Translation in the Americas, Columbia, Maryland, pp. 193-205.
Feedback from the Field: The Challenge of Users in
Motion
1 Introduction
The military has developed and tested a variety of systems that include machine
translation (MT) and has sent some of these systems to the field. Although we strive
to get information about the utility of these systems, feedback is often difficult to
obtain and, when it comes, tricky to interpret. Here, we reflect on the process of
getting feedback from operators at the “forward edge” – usually mobile and under
stress; we look at what kinds of indicators or predictors of field utility are available
short of observation and direct feedback; and we discuss the adequacy of these
indicators and predictors with examples from the field. On this basis, we consider how
to organize user tests of MT to better predict its utility in the field and improve the
quality of feedback we get from users on the move. Note that we do not provide
references for some of what we talk about, either because a technology deployment is
militarily sensitive or because specific vendors are involved or because the data have
not yet been officially released.
When we refer to machine translation in the field, we mean end-to-end systems in
which MT is an “embedded” component and input undergoes numerous processes
that affect translation (Reeder & Loehr, 1998; Voss & Reeder, 1998; Voss & Van
Ess-Dykema, 2004; 2000). Thus, feedback from the field must be interpreted in light
of how other components perform. Here, we consider the examples of MT embedded
in speech translators as well as document processing.
R.E. Frederking and K.B. Taylor (Eds.): AMTA 2004, LNAI 3265, pp. 94–101, 2004.
© Springer- Verlag Berlin Heidelberg 2004
Feedback from the Field: The Challenge of Users in Motion 95
When machine translation goes with troops to the field, it is often deployed in distant,
hard-to-access locations with intermittent phone and email connection. These
locations may be in developing or less developed countries with limited or damaged
infrastructure. Moreover, users become busy with missions, working up to 20 hours a
day, seven days a week, leaving little time to send feedback to scientists and
developers. Finally, soldiers are highly and often unpredictably mobile, and they are
often reassigned to new tasks, making it hard to track those who might provide
instructive data. Clearly, being on the ground with soldiers at the time they are using a
system is the best way to get feedback. However, on dangerous missions and in
theaters where threat to life is imminent, scientific and evaluation teams are typically
barred.
Army operators, then, are a special case of users in motion: they are distributed,
dynamic, and difficult to locate. Getting their feedback is a challenge. However, as we
have pointed out previously, their feedback is precious in part because they are an
ideal customer for MT (Holland, Schlesiger, & Turner, 2000): Soldiers tolerate poorer
quality translations, and they regularly apply such translations to tasks lower on the
Taylor & White text-handling scale (1998), such as relevance screening. Even when
soldiers serve as linguists or translators who perform high-level tasks – say, writing
complete translations – they often lack the degree of training, experience, and full-
time assignment to those tasks that characterizes translators in intelligence agencies.
These distinctions may, in fact, account for why soldiers seem disproportionately
welcoming of automatic translation. Finally, soldiers who need technology will
generally find ways to fix and adapt systems on the fly. Given the value of what
soldiers have to teach us, how can we facilitate getting feedback from troops in the
field or, alternatively, mitigate the lack of it? What information about MT utility is
available, and by what means, short of being on-site or directly linked to users?
When we cannot be on site, or be linked to users by email, phone, or fax, there are
other sources of impressions about or predictions of the utility of MT in the field:
Requests for particular systems from users in the field are often taken as an
indicator of utility.
Reports from military system managers who purchase, issue, and track deployed
systems provide second-hand feedback.
Results from formal and informal user tests before systems are fielded are used to
predict utility in the field.
How valid and complete are these sources compared with on-site observations or
other direct feedback? Below we consider each form of information.
96 L. Hernandez, J. Turner, and M. Holland
exercises that have operational realism. Such exercises are regularly scheduled both at
home and abroad to train and rehearse service members in skills they may need in
combat or peacekeeping – such as medical triage, logistics planning, or casualty
evacuation. Sometimes new technologies or prototypes can be incorporated into an
exercise to assess their suitability and to see where they need improvement.
Evaluation dimensions generally include effectiveness, suitability, and usability. Also,
outside of standard exercises, “limited utility evaluations” may be built around a
particular system that is of critical interest to a set of users.
The results of these assessments are sometimes used to support deployment
decisions. How adequately do the results predict whether a system is ready for field
operation – or, if a system is not deemed ready, how to prepare it?
The example of speech translation: What military exercises say. While flexible
two-way speech-to-speech translation is still a research goal (Lavie et al., 2001;
Schultz, 2002), more limited speech translation prototypes have been developed in “1
and ½ way” mode. This limited mode allows questions to be spoken in English and
limited responses to be spoken in another language. Responses are usually
constrained by the phrasing of questions – such as “Why did you come to this clinic?
Answer fever, vomiting, or pain” or “Tell me your age: just say a number.” In 2003-
04 a system in this mode was readied in a Central Asian language needed in the field.
A limited utility evaluation was performed at a domestic military post to ensure that
the system was ready to send with deploying troops. The results showed that the
system was usable, did not crash, usually recognized the English speakers, and often
recognized speakers of the other language. The latter speakers were language
instructors for whom the English input was masked, allowing them to play validly the
role of a non-English speaker.
This evaluation was reinforced by participation of the same 1 and ½ way system in
a joint yearly military exercise in 2004, where the system mediated between questions
in English and answers in a Pacific Rim language. While no measures of
communicative effectiveness were taken, an independent military test center surveyed
users on both the interviewer and the interviewee side about their use of and reaction
to the system and produced a report.
Thirty English-speaking troops who had acted as interviewers responded to 10
survey questions on translation effectiveness and system usability (for example, “It
was easy to conduct the queries for this scenario via the system” and “[system]
reduced the time needed to convey or acquire information.”) When responses were
aggregated, 95% (284/299) were positive. Twenty-eight speakers who had acted as
interviewees (and did not speak English) responded to two survey questions on “ease
of communication.” Summed across their responses, 93% percent were positive.
Thus, the survey responses were overwhelmingly favorable toward the system. The
system was concluded to have “garnered the most favorable impressions from the
target population” [of 3 different systems tried in the exercise – the other 2 being for
text translation]. Conclusions about suitability and usability stated that the platform
“was very durable and easy to use...extremely portable and capable of extended
battery operation” and that the system “was stable and did not crash” while the
translations were “accurate... acceptable and understood by the non-English speakers
during the interviews.”
98 L. Hernandez, J. Turner, and M. Holland
A telling limitation, however, showed up in the comments quoted from some users:
Recognizing these indicators, the report concluded that the scenarios – the question
and answer lists – that had been programmed into the system were “functionally
limited...[with] too few scenarios available, and the type of questions presented ...
did not reflect ... military operations.” Indeed, the report acknowledged that because
the scenarios did not match the requirements of the exercise, the user test was
conducted on the side rather than integrated into the exercise. But this was deemed
repairable: “Scenarios need to be expanded” by “coordinating with military
professional to create actual scenarios from start to finish in logical order.” It was
observed that “most users were successful in using the technology to conduct the
programmed scenarios.”
The example of speech translation: What the field says. What happened in the
field? An officer from the units deploying with the system reported to the deploying
agency that the systems weren’t being used; that when they were used, the foreign
speaker responses were sometimes not recognized at all and sometimes incorrectly
translated. Numbers were a noticeable problem. A respondent might say “35” when
asked their own age, or the age of a relative, and the response was translated as “23.”
Although no data were offered, it appeared that mistranslations happened often
enough to lead the troops to put aside the device and to call on human translators
when they needed to communicate with local people.
Given the limitations of second-hand feedback noted above, the home agency
decided to conduct another assessment to find out what was snagging communication
in the field. Maybe the recognizer was timing out too quickly, before respondents
completed their answer. Maybe respondents were intoning a dialect variation for
which the system had not been trained. Maybe ambient noise was causing
misrecognition. Maybe respondents were saying things that had not been included in
the response domain. The new assessment pointed to a combination of these factors
and suggested that the system had been prematurely fielded.
This suggestion is strengthened by a recent (2004) assessment of other selected
speech translators, also in the 1 and ½ way mode, as part of a medical exercise in
which use of the systems was an integral part of the larger triage and treatment goals
of the exercise. While users’ attitudes toward speech translators were favorable, the
systems did not work to mediate medic-patient communication. Results of this
assessment are being compiled by a military test agency.
Thus, the results of users tests in military exercises or in limited utility assessments
can mislead us about system utility and readiness. How can pre-deployment
evaluations be organized and designed to provide researchers and developers
diagnostic data that is fair and controlled, yet give decision-makers information that
better represents the realities of the field? The features of the above assessments
furnish pointers.
Feedback from the Field: The Challenge of Users in Motion 99
Feedback from missions overseas is sparse and precious. We might hypothesize that
some user tests tend to overestimate readiness. This would mean that MT may be
deployed before it is ready, making it likely that the feedback we get concerns all-or-
nothing function – it works or it fails – rather than specific features that can be
adjusted or enhanced or specific observations of efficiency and effectiveness and the
conditions that affect them. If pre-deployment evaluations can be organized and
designed to better predict utility in the field, or at least to rule out systems with major
shortcomings, then we can expect more meaningful feedback from users on the move.
At the same time, plunging systems into an exercise that simulates field conditions is
wasteful if a system has not already undergone tiers of testing that can pinpoint not
only component performance problems but also various levels of user difficulties so
that developers can address them.
Tests within the developer’s lab can isolate component performance problems and
permit these to be resolved before users are brought in. Subsequently, constrained
tests outside the developer’s lab with soldiers and with foreign language documents or
speakers can isolate levels of problems prior to integration of a system in a military
exercise. Two of the assessments of 1 and ½ way speech translation described above
– the limited utility evaluation at a domestic military post and the joint military
exercise in 2004 – exemplify constrained tests. What their design allows us to do is
determine whether under the best of circumstances – when interviewers and
interviewees say what they are supposed to say – the system can recognize and
translate the utterances. In both assessments, English speakers were trained on the
script (what they can say in English) and the foreign language speakers were well
prepared: although they were not given a script, they were allowed to practice with
the device, to repeat responses, and to learn as they went both acceptable response
length and how to conform answer content to the restrictions assumed by question
structure. These, then, were formative evaluations necessary to reveal basic functional
gaps in user interaction. If English speakers or speakers of the target language (albeit
carefully selected) are not able to learn how to use the device or cannot be recognized,
then the developer learns that there is a lot more work to do. But these evaluations
cannot substitute for integration into a field exercise that has a larger goal, that scripts
only the goals and not the words, and that views translation tools as one of many
means to an end.
Feedback from the field to date confirms the factor of surprise: The quality of
documents and the nature of their noise cannot be predicted – otherwise, OCR and
100 L. Hernandez, J. Turner, and M. Holland
preprocessors could be trained to manage some of that noise. Nor can the behavior
and responses of indigenous people be predicted who are faced with a speech
translation device and with questions posed in their language, even questions that are
very well-structured. The issue of coverage in MT comes to the foreground in the
field.
Military exercises are known for being extensively planned, setting up authentic
scenarios, using real equipment, and finding experienced role players to interact with
soldiers. They also attempt to bring in data that mirrors what is found in the field –
although the noise surrounding that data is hard to duplicate. Exercises run off
predefined scenarios, with clear goals as in a real mission; but like a real mission, the
scenarios are never scripted down to the word.
When MT is well integrated into such an exercise, we have a best chance of
realizing surprise and thereby predicting utility in the field. This kind of integration is
exemplified by the medical exercise referenced above, where role-players responding
to questions mediated by 1 and ½ way speech translation did not have the chance to
repeat and practice utterances until they were recognized. As patients in a disaster
relief scenario, their time with medics was too limited.
From our experience observing assessments of various kinds, from the lab to the
field, we make the following recommendations, especially with regard to speech MT:
Follow lab tests of MT systems with limited utility evaluations with representative
users and data.
Employ increasing levels of realism in limited utility evaluations, for example,
Through these measures, we expect that systems going to the field will be more
thoroughly vetted and that feedback from the field, when we can obtain it, will reflect
aspects and surprises that can only be discovered there.
Feedback from the Field: The Challenge of Users in Motion 101
References
1. Holland, M., Schlesiger, C.: High-Mobility Machine Translation for a Battlefield
Environment. Proceedings of NATO/RTO Systems Concepts and Integration Symposium,
Monterey, CA. Hull, Canada: CCG, Inc. (ISBN 92-837-1006-1) (1998) 15/1-3
2. Lavie A., Levin L., Schultz T., Langley C., Han B., Tribble, A., Gates D., Wallace D.,
Peterson K.: Domain Portability in Speech-to-speech Translation. Proceedings of the of the
First International Conference on Human Language Technology Conference (HLT 2001),
San Diego, March 2001.
3. Reeder, F., Loehr, D.: Finding the Right Words: An Analysis of Not-Translated Words in
Machine Translation. In: Farwell, D. et al. (eds.), Machine Translation and the Information
Soup: Proceedings of the Association for Machine Translation in the Americas Annual
Meeting. Springer-Verlag (1998) 356-363
4. Schultz T. GlobalPhone: A Multilingual Speech and Text Database developed at Karlsruhe
University. ICSLP2002, Denver, Colorado. (2002)
5. Taylor, K., White, J.: Predicting What MT is Good for: User Judgments and Task
Performance. In: Farwell, D. et al. (eds.), Machine Translation and the Information Soup:
Proceedings of the Association for Machine Translation in the Americas Annual Meeting.
Springer-Verlag (1998) 364-373
6. Voss, C., Reeder, F. (eds.): Proceedings of the Workshop on Embedded Machine
Translation: Design, Construction, and Evaluation of Systems with an MT Component. (In
conjunction with the Association for Machine Translation in the Americas Annual Meeting,
Langhorne, PA). Adelphi, MD: Army Research Lab. (1998)
7. Voss, C., Van Ess-Dykema, C.: When is an Embedded MT System “Good Enough” for
Filtering? Proceedings of the Embedded Machine Translation Workshop II. In conjunction
with the Applied Natural Language Processing Conference, Seattle (2000)
The Georgetown-IBM Experiment
Demonstrated in January 1954
W. John Hutchins
jhutchins@beeb.net;
http://ourworld.compuserve.com/homepages/WJHutchins
1 The Impact
On the 8th January 1954, the front page of the New York Times carried a report of a
demonstration the previous day at the headquarters of International Business
Machines (IBM) in New York under the headline ,,Russian is turned into English by a
fast electronic translator“:
A public demonstration of what is believed to be the first successful use of a
machine to translate meaningful texts from one language to another took
place here yesterday afternoon. This may be the cumulation of centuries of
search by scholars for ,,a mechanical translator.“
Similar reports appeared the same day in many other American newspapers (New
York Herald Tribune, Christian Science Monitor, Washington Herald Tribune, Los
Angeles Times) and in the following months in popular magazines (Newsweek, Time,
Science, Science News Letter, Discovery, Chemical Week, Chemical Engineering
News, Electrical Engineering, Mechanical World, Computers and Automation, etc.) It
was probably the most widespread and influential publicity that MT has ever
received. The experiment was a joint effort by two staff members of IBM, Cuthbert
Hurd and Peter Sheridan, and two members of the Institute of Languages and
Linguistics at Georgetown University, Leon Dostert and Paul Garvin.
2 The Background
Léon Dostert had been invited to the first conference on machine translation two years
before in June 1952. He had been invited for his experience with mechanical aids for
translation. Dostert had been Eisenhower’s personal interpreter during the war, had
been liaison officer to De Gaulle, and had worked for the Office of Strategic Services
(predecessor of the Central Intelligence Agency). After the war he designed and
installed the system of simultaneous interpretation used during the Nuremberg war
crimes tribunal, and afterwards at the United Nations. In 1949 he was invited to
R.E. Frederking and K.B. Taylor (Eds.): AMTA 2004, LNAI 3265, pp. 102–114, 2004.
© Springer-Verlag Berlin Heidelberg 2004
The Georgetown-IBM Experiment Demonstrated in January 1954 103
3 The Demonstration
Reports of the demonstration appeared under headlines such as ,,Electronic brain
translates Russian“, ,,The bilingual machine“, ,,Robot brain translates Russian into
King’s English“, and ,,Polyglot brainchild“ – at the time computers were commonly
referred to as ‘electronic brains’ and ‘giant brains’ (because of their huge bulk).
The newspapermen were much impressed:
In the demonstration, a girl operator typed out on a keyboard the following
Russian text in English characters: ,,Mi pyeryedayem mislyi posryedstvom
ryechi“. The machine printed a translation almost simultaneously: ,,We transmit
thoughts by means of speech.“ The operator did not know Russian. Again she
types out the meaningless (to her) Russian words: ,,Vyelyichyina ugla
opryedyelyayatsya otnoshyenyiyem dlyini dugi k radyiusu.“ And the machine
translated it as: ,,Magnitude of angle is determined by the relation of length of
arc to radius.“ (New York Times)
It appears that the demonstration began with the organic chemistry sentences.
Some of these were reported, e.g.
The quality of coal is determined by calory content
Starch is produced by mechanical method from potatoes.
but the journalists were clearly much more impressed by those on other topics:
And then just to give the electronics a real workout, brief statements about
politics, law, mathematics, chemistry, metallurgy, communications, and
military affairs were submitted in the Soviet language... (Christian Science
Monitor)
All the reports recognised the small scale of the experiment but they also reported
future predictions from Dostert:
104 W.J. Hutchins
4 The Processes
Most of the reports are illustrated with a photograph of a punched card with a Russian
sentence; and many have photographs of the machines and of the Georgetown and
IBM personnel. But they gave few hints of how the system worked.
The most common references were to rules for inversion, all using the example of
Russian gyeneral mayor, which has to come out in English as major general. One
gave some idea of the computer program:
The switch is assured in advance by attaching the rule sign 21 to the Russian
gyeneral in the bilingual glossary which is stored in the machine, and by
attaching the rule-sign 110 to the Russian mayor. The stored instructions, along
with the glossary, say ,,whenever you read a rule sign 110 in the glossary, go
back and look for a rule-sign 21. If you find 21, print the two words that follow
it in reverse order (Journal of Franklin Institute, March 1954)
A few explained how rules selected between alternative translations:
The word root ,,ugl“ in Russian means either ,,angle“ or ,,coal“ depending upon
its suffix. This root is stored in the form of electrical impulses on a magnetic
The Georgetown-IBM Experiment Demonstrated in January 1954 105
drum together with its English meanings and the Garvin rules of syntax and
context which determine its meaning. The code is so set up so that when the
machine gets electrical impulses via the punched cards that read ,,ugla“ it
translates it as ,,angle“, when ,,uglya“ the translation is ,,coal“. (New York
Herald Tribune)
It is doubtful whether newspaper readers would have gained much understanding
from these brief explanations. However, some of the weeklies went into much more
detail. Macdonald’s report [4] included a list of the six rules, a flowchart of the
program for dictionary lookup and a table illustrating the operation of the rules on a
sample sentence.
5 The Computer
An illuminating account of the computational aspects of the experiment is given in the
contemporary article by Peter Sheridan [9]. As the first substantial attempt at non-
numerical programming, every aspect of the process had involved entering quite
unknown territory. Decisions had to be made on how alphabetic characters were to be
coded, how the Russian letters were to be transliterated, how the Russian vocabulary
was to be stored on the magnetic drum, how the ‘syntactic’ codes were to operate and
how they were to be stored, how much information was to go on each punched card,
etc. Detailed flow charts were drawn up for what today would be simple and
straightforward operations, such as the identification of words and their matching
against dictionary entries.
The IBM 701-type machine had been developed for military applications and was
first installed in April 1953. Like other computers of the day its main tasks were the
solution of problems in nuclear physics, rocket trajectories, weather forecasting, etc. It
was hired out initially at $15,000 per month, and later sold at $500,000 – and was at
that time only one of about 100 general-purpose computers in existence. Its huge size
was impressive; it was likened to ,,an assortment of 11 complicated electronic units,
not unlike modern kitchen ranges, connected by cables to function as a unit“ and
,,which occupy roughly the same area as a tennis court.“ [7]. A similar-sized machine,
the 702, was also developed for business applications. Its successor in late 1955 was
the 704 model, a substantial improvement on the 701 and which sold in large
numbers.
The 701 could perform 33 distinct operations: addition, subtraction,
multiplication, division, shifting, transfers, etc. – all coded in ‘assembly language’.
Multiplication was performed at 2,000 per second. It consisted of two types of
storage. Electrostatic (high-speed) storage was in the form of a bank of cathode ray
tubes; each unit could accommodate up to 2048 ,,full words“, where a ,,full word“
comprised 35 bits (binary digits) and one sign bit – 36 bits in all. Each full word could
be split (stored) as two ,,half words“, each of 17 bits and one sign bit. Although the
701 had two electrostatic units, only one was used in the MT experiment. Average
access time was 12 microseconds. The second type of storage (with lower access
speed, 40 milliseconds) was a magnetic drum unit comprising four ‘addressable’
drums, each accommodating up to 2048 ‘full words’. The magnetic drum was used to
store dictionary information; the reading and writing rate was 800 words per second.
Input to the 701 was by card reader. Information from 72 column cards (precursor
106 W.J. Hutchins
of the familiar IBM 80 column punched cards in use for computer input until the early
1970s) – i.e. each with a maximum capacity of 72 upper case (capital letter)
alphabetic or numeric characters – could be read and converted to internal binary code
at a rate 150 per minute. Output was by a line printer (also capital letters only) at a
rate of 150 lines per minute.
The program used a seven-bit code for characters: six bits for distinguishing 40
alphanumeric and other characters, plus one sign bit used for various tests (see
below). This means that each ,,full word“ location could contain up to five characters.
The Russian-English dictionary was input by punched cards and stored on the
(low-speed) magnetic drum. The Russian word and the English equivalents (two
maximum) were stored on consecutive locations, separated by ‘full words’ containing
zeros. They were followed by the so-called diacritics on consecutive drum locations.
Each ‘word’ included a ‘sign bit’, either + or -, which indicated whether the entry was
for a stem or for an ending, respectively.
Sentences were read into the electrostatic storage, separated by strings of zero-
filled ‘words’. The input words were then each looked up in the drum storage, first by
consultation of a ,,thumb index“ which gave the address (location) of the first word in
the dictionary with the same initial letter. The lookup routine searched for the longest
matching string of characters (whether complete word or stem plus hyphen), extracted
the (two) English equivalents onto a separate area of the store, and copied the
diacritics onto another area of the store. A special area was also set aside for the
temporary (erasable) location of word-endings. Each of these areas and addresses
would have to be specified either directly (specifically by store address) or indirectly
(using variables) in the program (called Lexical Syntax Subprogram). Sheridan
describes the operations of comparison in terms of successive and repeated processes
of logical multiplication, addition and subtraction using ‘masks’ (sequences of binary
digits). When a ‘diacritic’ indicated that either the first English equivalent or the
second English equivalent was to be selected, then the program went back to the
addresses to the separate store area, and transferred the one selected to a (temporary)
print-out area of the electrostatic store.
‘121’; if it is ‘222’, adopt English equivalent II. In both cases, retain order
of appearance of output words.
Rule 3. Choice-Rearrangement. If first code is ‘131’, is third code of
preceding complete word or either portion (root or ending) of preceding
subdivided word equal to ‘23’? If so, adopt English equivalent II of word
carrying ‘131’, and retain order of appearance of words in output – if not,
adopt English equivalent I and reverse order of appearance of words in
output.
Rule 4. Choice-Previous text. If first code is ‘141’, is second code of
preceding complete word or either portion (root or ending) of preceding
subdivided word equal to ‘241’ or ‘242’? If it is ‘241’, adopt English
equivalent I of word carrying ‘141’; if it is ‘242’ adopt English equivalent
II. In both cases, retain order of appearance of words in output.
Rule 5. Choice-Omission. If first code is ‘151’, is third code of following
complete word or either portion (root or ending) of following subdivided
word equal to ‘25’? If so, adopt English equivalent II of word carrying
‘151’; if not, adopt English equivalent I. In both cases, retain order of
appearance of words in output.
Rule 6. Subdivision. If first code associated with a Russian dictionary word
is then adopt English equivalent I of alternative English language
equivalents, retaining order of appearance of output with respect to previous
word.
According to Sheridan, the rules formulated in this manner were easily converted
into program code for the 701 computer. Sheridan’s account [9] includes a flowchart
of the processes of dictionary lookup and ‘operational syntax’.
7 The Sentences
The most detailed account of the linguistic operations is given by Garvin [3] in a
retrospective evaluation of the experiment and its significance. He includes 137
dictionary entries for words, stems and endings; and 49 of the original Russian
sentences (in a transliteration scheme devised for the experiment)
Most of the ,,more than 60“ sentences in the demonstration concerned topics of
organic chemistry. These were intended to illustrate the variety of sentence patterns
which the rules could deal with, and nouns and verbs occurring in different roles.
Some examples are:
(1) (a) They prepare TNT; (b) They prepare TNT out of coal; (c) TNT is
prepared out of coal; (d) TNT is prepared out of stony coal; (e) They
prepare ammonite; (f) They prepare ammonite out of saltpeter; (g)
Ammonite is prepared out of saltpeter.
(2) (a) They obtain gasoline out of crude oil; (b) Gasoline is obtained out of
crude oil; (c) They obtain dynamite from nitroglycerine; (d) Ammonite is
obtained from saltpeter; (e) Iron is obtained out of ore; (f) They obtain
iron out of ore; (g) Copper is obtained out of ore.
(3) (a) They produce alcohol out of potatoes; (b). Alcohol is produced out of
potatoes; (c). They produce starch out of potatoes; (d). Starch is produced
108 W.J. Hutchins
The example above illustrates also the method of distinguishing homonyms which
newspapers reported, namely the selection of angle or coal for ‘ugl-’. Rule 2 (initiated
by PID ‘121’) searches for ‘221’ or ‘222’ in the following entry. When it is ‘-a’, ‘221’
is found and the result is the choice of angle; when it is ‘-ya’, ‘222’ is found and the
choice is coal. In fact, the original Russian is not strictly a homonym, there are two
separate words: (corner or angle) and (coal). Garvin’s procedure is based
on the fact that the genitive for is and the genitive for is
The example also illustrates Garvin’s approach to the ambiguity (or rather
multiple possible translations) of prepositions. Garvin reduces the problem by
providing just two equivalents, which are selected by occurrences of ‘221’ or ‘222’ in
the ending of the following noun – which can certainly be justified linguistically.
A number of sentences contain instrumental phrases (by chemical process, by
mechanical method). Each are generated by the production of the preposition by from
the case ending of the adjective and then translating the following noun as if it had no
case ending. This would be regarded then and now as abnormal, since adjectival
forms are held to be dependent on the nouns they are modifying, so we would expect
by to be generated from the noun ending. For example, the phrase by optical
measurement:
The entry ‘optichesk-’ with PID is printed out (optical); the suffix ‘-yim’
initiates rule ‘131’, fails to find ‘23’, outputs (by) and inverts word order
(i.e. by optical). The next entry ‘yizmyeryenyi-’ has one equivalent (measurement);
and its suffix ‘-yim’ invokes rule ‘131’, finds a ‘23’ in the preceding
subdivided word (the entry for ‘-yim’), selects Eng2 (‘---’) and retains word order.
From these descriptions it is apparent that much of the variety of sentences is
derived from a fairly restricted set of interlocking rules and codes operating on fixed
patterns into which nouns and phrase constructions are slotted.
Many of the operations are quite clearly specific to the particular words and
sentences in the sample, and the rules are applied as appropriate in specific instances
– this is particularly obvious in the non-chemistry sentences. There was no attempt to
align rules with specific linguistic features. In particular, there was no analysis in
terms of grammatical categories (noun, verb, adjective) and no representation of
either agreement relations, or dependency relations, or phrase/clause structures.
9 The Consequences
The Russians had seen reports of the January event and the time was propitious (after
the ‘thaw’ following Stalin’s death) for the Russians to engage in the development of
computers and their applications. The Institute of Precise Mechanics and Computer
Technology had just completed development of BESM, the first working Russian
computer, and machine translation was to be among its first applications, under the
direction of Yurij Panov. By early 1956 it was ready to demonstrate a prototype
The Georgetown-IBM Experiment Demonstrated in January 1954 111
system, which in many respects followed the design of the IBM-Georgetown system,
with a basic set of rules for substitution, movement and morphological splitting [8].
At Georgetown itself, however, despite the widespread publicity there was no
official support for further research until a grant in June 1956 from the National
Science Foundation, stimulated, as it seemed at the time, by the Soviet interest [5].
The funds were in fact from the Central Intelligence Agency – Dostert had worked for
its predecessor (Office for Strategic Services) and was a good friend of its director
Allen Dulles. A full-scale project for Russian-English translation was organized with
more than twenty researchers [5]. Initially two groups were set up: one for developing
a dictionary, the other for linguistic analysis. After examining the coding of the 1954
experiment for a few months, the group decided to abandon continuation on these
lines – a fact often forgotten by later critics of the Georgetown activity. There
emerged a considerable divergence of opinions, and Dostert decided to give each of
the proposed methods a chance to show its capability in ‘free competition’. By
January 1957 there were four groups, known as ‘code-matching’, ‘syntactic analysis’,
‘general analysis’, and ‘sentence-by-sentence’. The first group, headed by Ariadne
Lukjanow, assigned codes to dictionary entries which indicated grammatical and
association functions and which were compared and matched during analysis. The
second group under Garvin developed a method of dependency syntactic analysis
later known as ‘fulcrum method’. The third group under Michael Zarechnak
formulated a method of sentence analysis at various levels (morphological,
syntagmatic, syntactic), i.e. a variant of ‘phrase structure’ analysis. The fourth ‘group’
was a one-man project of French-English translation by A.F.R.Brown where
procedures developed first for one sentence were tested on another, more procedures
were added, tested on another sentence, further procedures were added, tested, and so
forth. In due course, Lukjanow and Garvin left the Georgetown project to continue
elsewhere, and the ‘general analysis’ method was adopted together with Brown’s
computational techniques [5], [10], [11].
10 The Assessments
A year after the demonstration, Dostert [2] gave an assessment of the significance of
the experiment, and suggested future ideas for MT development. In his opinion, the
experiment (a) ,,has given practical results by doing spontaneous, authentic, and clear
translation“, (b) showed that ,,the necessity of pre- and post-editing has not been
verified“, (c) demonstrated that ,,the primary problem in mechanical translation... is a
problem of linguistic analysis...“, (d) formulated ,,the basis for broader systematic
lexical coding“, defining ,,four specific areas of meaning determination... from which
fruitful results may be expected“, (e) developed a ,,functional coding system,
permitting the preparation of functional, subfunctional and technical lexicons...
reducing the magnitude of the coding requirements and thereby... the extent of
storage needs“, and (f) provided a ,,theory for the development of a general code for
the mechanical formulation of multilingual syntax operations“. These were major
claims, and clearly not justified on the basis of this first small-scale experiment.
Rather these were expectations Dostert had for later research.
The retrospective assessment by Garvin [3] was much more modest than
Dostert’s. In this somewhat defensive account of his early essay in MT research,
112 W.J. Hutchins
Garvin freely admitted the shortcomings of the experiment. The limitations were the
consequence of restricting the algorithm to ,,a few severely limited rules, each
containing a simple recognition routine with one or two simple commands.“
Nevertheless, in Garvin’s view, the experiment was ,,realistic because the rules dealt
with genuine decision problems, based on the identification of the two fundamental
types of translation decisions: selection decisions and arrangement decisions.“
Garvin summarised the limitations as of two principal types: the restriction of the
search span to immediately adjacent items, the restriction of target words to just two
possibilities, and the restriction of rearrangements to two immediately adjacent items.
The choice of target language equivalents was restricted to those which were
idiomatic for the selected sentences only. The limitation of the procedure for Russian
case endings was severe: either a case suffix was not translated at all or it was
translated by one ,,arbitrarily assigned“ English preposition. Further limitations were
highlighted by Michael Zarechnak [11], a member of the Georgetown group. None of
the Russian sentences had negative particles; all were declaratives; there were no
interrogatives or compound sentences (coordinate or subordinate clauses); and nearly
all the verbs were in the third person.
Does this mean that the experiment was fixed, a deception? Naturally members of
the Georgetown group deny it – pointing out that the program ,,was thoughtfully
specified and implemented; the program ran, the translation was generated according
to the program, which was developed based on... linguistic principles.“ [6]. This was
basically true, however, only for the chemistry sentences and the rules and dictionary
entries which were applied for their translation. Further chemistry sentences could
clearly have been treated, with an expansion of the dictionary – but only as long as the
sentences conformed to the patterns of those in the sample. There are many chemistry
sentences that would obviously not be covered by the rules. Although organic
chemistry might constitute a sublanguage and its vocabulary might be captured in a
‘micro-glossary’ (as others advocated at the time) with few ambiguities, this program
in 1954 did not cover all of the field, nor indeed a substantial proportion of it. As for
the non-chemistry sentences, these were clearly produced by dictionary entries and
codes specifically designed for this particular demonstration; and there could have
been no question of expanding general coverage on the lines of this program – as
indeed was found in the later research at Georgetown.
The limitations of the experiment made it possible for the output to be
impressively idiomatic, and would have suggested to many observers (not only
reporters) that continued experiments on the same lines would lead to systems with
larger vocabularies and even better quality. On the other hand, the experiment drew
attention to the importance of the linguistic problems, and in particular that translation
was not a trivial task even for the largest computer, as a contemporary commentator,
Ornstein [7], remarked:
the formulation of the logic required to convert word meanings properly, even in
a small segment of two languages, necessitated as many instructions to the
computer as are required to simulate the flight of a guided missile.
In truth, neither Dostert nor Garvin claimed much more than that it was a first
effort (a ,,Kitty Hawk“ experiment) – not even a prototype system. In later years, they
might well have agreed that the demonstration had been premature; certainly it was
made public at a stage much earlier than other contemporary MT researchers would
The Georgetown-IBM Experiment Demonstrated in January 1954 113
have contemplated. However, there was another probably more important aim for
Dostert; it was to attract funds for further research at Georgetown, and it succeeded.
11 The Implications
Undoubtedly, the wrong impression had been given that automatic translation of good
quality was much closer than was in fact the case. Sponsorship and funding of MT
research in the following years were more liberal (and unquestioning) than they ought
to have been. The results in the next 10 years were inevitably disappointing, and as a
consequence, the funders set up an investigation committee, ALP AC [1]. One of the
principal arguments used by ALP AC was that MT output had to be extensively post-
edited. They pointed out that the Georgetown-IBM output was of a quality that had no
need of editing while that of later Georgetown systems did. The mistake of ALP AC
was to ignore the preliminary nature of the 1954 experiment, that it had been
specifically designed for a small sample of sentences, that it had not been a
‘prototype’ system but a ‘showcase’ intended to attract attention and funds, and that
comparisons with full-scale MT systems were invalid.
In 1954, when other MT groups saw reports of Dostert’s demonstration they were
disparaging or dismissive. They disliked three things. One was the conduct of
research through newspapers; another was the exaggerated publicity given by
journalists to an obviously incomplete system; and a third was the passing-off as true
‘translations’ sentences which could only have been extracted as wholes from
computer memories. Other MT groups were far from even thinking of demonstrating
their results – and were unprepared to do so for many years to come.
It was only the first of demonstrations by the Georgetown group. Later ones were
undoubtedly more genuine – systems had not been ‘doctored’ – but the suspicion of
other MT groups was that they were not all they appeared. Such suspicions continued
to haunt the Georgetown group throughout its existence and have coloured the
judgements of later commentators.
It does have to be admitted, however, that the Georgetown-IBM demonstration
has not been the only example of a MT system being ‘doctored’ for a particular
occasion. In subsequent years it was not uncommon for demonstrated systems to
introduce grammar and vocabulary rules specifically to deal with the sentences of a
particular text sample, with the aim of showing their system in the best possible light.
In recent years MT researchers have been much more circumspect when
demonstrating experimental systems and have been less willing to indulge in
speculations for journalists. The painful lessons of the Georgetown-IBM
demonstration seem to have been learned by MT researchers. On the other hand, some
vendors of systems have a more ‘liberal’ attitude: many MT systems are being
publicised and sold (particularly on the internet) with equally exaggerated claims and
perhaps with equally damaging impact for the future of machine translation.
The historical significance of the Georgetown-IBM demonstration remains that it
was an actual implementation of machine translation on an operational computer.
Before 1954, all previous work on MT had been theoretical. Considering the state of
the art of electronic computation at the time, it is remarkable that anything resembling
automatic translation was achieved at all. Despite all its limitations, the demonstration
114 W.J. Hutchins
References
1. ALPAC: Languages and machines: computers in translation and linguistics. A report by
the Automatic Language Processing Advisory Committee, Division of Behavioral
Sciences, National Academy of Sciences, National Research Council. Washington, D.C.:
National Academy of Sciences – National Research Council. (1966)
2. Dostert, Leon E. : ‘The Georgetown-I.B.M. experiment’, in Locke, W.N. and Booth, A.D.
(eds.) Machine translation of languages. Cambridge, Mass.: M.I.T.Press (1955), 124-135.
3. Garvin, Paul: ‘The Georgetown-IBM experiment of 1954: an evaluation in retrospect’,
Papers in linguistics in honor of Dostert. The Hague: Mouton (1967), 46-56; reprinted in
his On machine translation . The Hague: Mouton (1972), 51-64.
4. Macdonald, Neil: ‘Language translation by machine - a report of the first successful trial’,
Computers and Automation 3 (2), 1954, 6-10.
5. Macdonald, R.R.: ‘The history of the project’ in: General report 1952-1963, ed.
R.R.Macdonald. Washington, DC: Georgetown University Machine Translation Project
(1963), 3-17. (Georgetown University Occasional Papers on Machine Translation, 30)
6. Montgomery, Christine A.: ‘Is FAHQ(M)T impossible? Memoirs of Paul Garvin and other
MT colleagues’, in: W. John Hutchins (ed.) Early years in machine translation: memoirs
and biographies of pioneers. Amsterdam/Philadelphia: John Benjamins (2000), 97-110.
7. Orntein, Jacob: ‘Mechanical translation: new challenge to communication’, Science 122,
21 (October 1955), 745-748.
8. Panov, Yu.N.: Automatic translation. Translated by R.Kisch. London: Pergamon. (1960)
9. Sheridan, Peter: ‘Research in language translation on the IBM type 701’, IBM Technical
Newsletter 9, 1955, 5-24.
10. Vasconcellos, Muriel: ‘The Georgetown project and Leon Dostert: recollections of a young
assistant’, in: W. John Hutchins (ed.) Early years in machine translation: memoirs and
biographies of pioneers. Amsterdam/Philadelphia: John Benjamins (2000), 87-96.
11. Zarechnak, Michael: ‘The history of machine translation’, in: Bozena Henisz-Dostert, R.
Ross Macdonald and Michael Zarechnak, Machine translation. The Hague: Mouton
(1979), 20-28.
Pharaoh:
A Beam Search Decoder for Phrase-Based
Statistical Machine Translation Models
Philipp Koehn
Computer Science and Artificial Intelligence Laboratory
Massachusetts Institute of Technology
koehn@csail.mit.edu
R.E. Frederking and K.B. Taylor (Eds.): AMTA 2004, LNAI 3265, pp. 115–124, 2004.
© Springer-Verlag Berlin Heidelberg 2004
116 P. Koehn
2 Decoder
We will now describe the search algorithm of Pharaoh. More detailed information
is given by Koehn [2003]. The decoder implements a beam search and is roughly
similar to work by Tillmann [2001] and Och [2002]. In fact, by reframing Och’s
alignment template model as a phrase translation model, the decoder is also
suitable for his model (without the use of word classes).
We start with defining the concept of translation options, describe the basic
mechanism of beam search, and its necessary components. We continue with
details on word lattice generation and XML markup as interface for external
components.
Translation Options — Given an input string of words, a number of phrase
translations could be applied. We call each such applicable phrase translation
a translation option. This is illustrated in Figure 2, where a number of phrase
Pharaoh: A Beam Search Decoder 117
Fig. 2. Some translation options for the Spanish input sentence Maria no daba una
bofetada a la bruja verde
Fig. 3. State expansion in the beam decoder: in each expansion English words are
generated, additional foreign words are covered (marked by and the probability cost
so far is adjusted. In this example the input sentence is Maria no daba una bofetada a
la bruja verde.
translations for the Spanish input sentence Maria no daba una bofetada a la
bruja verde are given.
Translation options are collected before any decoding takes place. This allows
a quicker lookup than consulting the phrase translation table during decoding.
Core Algorithm — The phrase-based decoder we developed employs a
beam search algorithm, similar to the one by Jelinek [1998] for speech recogni-
tion. The English output sentence is generated left to right in form of hypotheses.
This process is illustrated in Figure 3. Starting from the initial hypothesis,
the first expansion is the foreign word Maria, which is translated as Mary. The
foreign word is marked as translated (marked by an asterisk). We may also
expand the initial hypothesis by translating the foreign word bruja as witch.
We can generate new hypotheses from these expanded hypotheses. Given
the first expanded hypothesis we generate a new hypothesis by translating no
with did not. Now the first two foreign words Maria and no are marked as
being covered. Following the back pointers of the hypotheses we can read of the
(partial) translations of the sentence.
118 P. Koehn
Given this algorithm, the size of the search space of possible translation
is exponential to the length of the sentence. To overcome this, we prune out
hypotheses using hypothesis recombination and by heuristic pruning.
Recombining Hypotheses — Recombining hypothesis is a risk-free way
to reduce the search space. Two hypotheses can be recombined if they agree in
(i) the foreign words covered so far, (ii) the last two English words generated,
and (iii) the end of the last foreign phrase covered.
If there are two paths that lead to two hypotheses that agree in these prop-
erties, we keep only the cheaper hypothesis, e.g., the one with the least cost so
far. The other hypothesis cannot be part of the path to the best translation, and
we can safely discard it. We do keep a record of the additional arc for lattice
generation (see below).
Beam Search — While the recombination of hypotheses as described above
reduces the size of the search space, this is not enough for all but the shortest
sentences. Let us estimate how many hypotheses (or, states) are generated dur-
ing an exhaustive search. Considering the possible values for the properties of
unique hypotheses, we can estimate an upper bound for the number of states
by where is the number of foreign words, and the size
of the English vocabulary. In practice, the number of possible English words for
the last two words generated is much smaller than The main concern is
the exponential explosion from the possible configurations of foreign words
covered by a hypothesis. Note this causes the problem of machine translation
decoding to be NP-complete [Knight, 1999] and thus dramatically harder than,
for instance, speech recognition.
In our beam search we compare the hypotheses that cover the same number of
foreign words and prune out the inferior hypotheses. We could base the judgment
of what inferior hypotheses are on the cost of each hypothesis so far. However,
this is generally a very bad criterion, since it biases the search to first translating
the easy part of the sentence. For instance, if there is a three word foreign
phrase that easily translates into a common English phrase, this may carry much
less cost than translating three words separately into uncommon English words.
The search will prefer to start the sentence with the easy part and discount
alternatives too early.
So, our measure for pruning out hypotheses in our beam search does not only
include the cost so far, but also an estimate of the future cost. This future cost
estimation should favor hypotheses that already covered difficult parts of the
sentence and have only easy parts left, and discount hypotheses that covered the
easy parts first. For details of our future cost estimation, see [Koehn, 2003].
Given the cost so far and the future cost estimation, we can prune out hy-
potheses that fall outside the beam. The beam size can be defined by threshold
and histogram pruning. A relative threshold cuts out a hypothesis with a prob-
ability less than a factor of the best hypotheses (e.g., Histogram
pruning keeps a certain number of hypotheses (e.g.,
Note that this type of pruning is not risk-free (opposed to the recombination).
If the future cost estimates are inadequate, we may prune out hypotheses on the
Pharaoh: A Beam Search Decoder 119
Fig. 5. Hypothesis expansion: Hypotheses are placed in stacks according to the number
of foreign words translated so far. If a hypothesis is expanded into new hypotheses,
these are placed in new stacks.
a graph. Paths branch out when there are multiple translation options for a
hypothesis from which multiple new hypotheses can be derived. Paths join when
hypotheses are recombined.
The graph of the hypothesis space (See Figure 3) can be viewed as a prob-
abilistic finite state automaton. The hypotheses are states, and the records of
back-links and the additionally stored arcs are state transitions. The added prob-
ability scores when expanding a hypothesis are the costs of the state transitions.
Finding the n-best path in such a probabilistic finite state automaton is a
well-studied problem. In our implementation, we store the information about
hypotheses, hypothesis transitions, and additional arcs in a file that can be
processed by the finite state toolkit Carmel2, which we use to generate the n-
best lists. This toolkit uses the shortest paths algorithm by Eppstein [1994].
Our method is related to work by Ueffing et al. [2002] for generating n-best lists
for IBM Model 4.
3 XML-Markup
While statistical machine translation methods cope well with many aspects of
translation of languages, there are a few special problems for which better so-
lutions exist. One example is the translation of named entities, such as proper
names, dates, quantities, and numbers.
Consider the task of translating numbers, such as 17.55. In order for a sta-
tistical machine translation system to be able to translate this number, it has to
observed it in the training data. But even if it has been seen a few times, it is
possible that the translation table learned for this “word” is very noisy.
Translating numbers is not a hard problem. It is therefore desirable to be
able to tell the decoder up front how to translate such numbers. To continue our
example of the number 17.55, this may take the following form:
2
available at http://www.isi.edu/licensed-sw/carmel/
Pharaoh: A Beam Search Decoder 121
This modified input passes to the decoder not only the German words, but
also that the third word, the number 17,55, should be translated as 17.55.
Phrase-Based Translation with XML Markup — Marking a sequence
of words and specifying a translation for them fits neatly into the framework
of phrase-based translation. In a way, for a given phrase in the sentence, a
translation is provided, which is in essence a phrase translation with translation
probability 1. Only for the other parts of the sentence, translation options are
generated from the phrase translation table.
Since the first step of the implementation of the beam search decoder is
to collect all possible translation options, only this step has to be altered to
be sensitive to specifications via XML markup. The core algorithm remains
unchanged.
Passing a Probability Distribution of Translations — Making the
hard decision of specifying the translation for parts of the sentence has some
draw-backs. For instance, the number 1 may be translated as 1, one, a, and so
on into English. So we may want to provide a set of possible translations.
Instead of passing one best translation choice to the decoder, we may want
to pass along a probability distribution over a set of possible translations. Given
several choices, the decoder is aided by the language model to sort out which
translation to use. We extend the XML markup scheme by allowing the specifica-
tion of multiple English translation options along with translation probabilities.
Here, both a small house and a little house are passed along as possible
translations, with the translation probabilities 0.6 and 0.4, respectively.
The scores that are passed along with the translations do not have to be
probabilities in a strict sense, meaning, they do not have to add up to 1. They
also do not include language model probabilities, but only the phrase translation
probability for the marked part of the sentence.
Multi-Path Integration — By specifying a set of possible translations, we
can deal with uncertainty which of the translations is the right one in a given
context. But there is also the uncertainty, when to use specified translations at
all. Maybe the original model has a better way to deal with the targeted words.
Recognizing that the hard decision of breaking out certain words in the in-
put sentence and providing translations to them may be occasionally harmful,
we now want to relax this decision. We allow the decoder to use the specified
translations, but also to bypass them and use its own translations.
We call this multi-path integration, since we allow two pathways when trans-
lating. The path may go through the specified translations, or through transla-
tion options from the regular translation model.
4 Experiments
In this section, we will report on a few experiments that illustrate the speed and
accuracy of the decoder at different beam sizes.
122 P. Koehn
ber of translation table entries that are used for each foreign phrase, the number
of generated hypotheses can be reduced – not just the number of hypotheses
that are entered into the stack.
Table 3 shows the linear increase in speed in respect to the translation table
limit. If we limit the number of translation options to, say, 50, we can drop
the translation time per sentence to under a second, with the same 1% relative
search error.
In conclusion, using the beam threshold 0.1, maximum stack size 1000, and
translation table limit 50, we can dramatically increase the speed of the decoder
– by a factor of 1000 in respect to the original setting – with only limited cost
in terms of search errors.
5 Conclusions
We described Pharaoh, a beam search decoder for phrase-based statistical ma-
chine translation models. Our experiments show that the decoder can achieve
very fast translation performance (one sentence per second).
The decoder also allows the generation word lattices and n-best lists, which
enable the exploration of reranking methods. An XML interface allows the inte-
gration of external knowledge, e.g., a dedicated name translation module.
We hope that the availability of the decoder fosters research in training phrase
translation models and the integration of machine translation into wider natural
language applications.
Acknowledgments — The development of the decoder was aided by many
helpful hints by Franz Och. The XML markup owes much to Ulrich Germann,
who developed a similar scheme for his greedy decoder for IBM Model 4 [Ger-
mann, 2003].
124 P. Koehn
References
Eppstein, D. (1994). Finding the shortest paths. In Proc. 35th Symp. Foun-
dations of Computer Science, pages 154–165. IEEE.
Germann, U. (2003). Greedy decoding for statistical machine translation in
almost linear time. In Proceedings of HLT-NAACL.
Jelinek, F. (1998). Statistical Methods for Speech Recognition. The MIT Press.
Knight, K. (1999). Decoding complexity in word-replacement translation models.
Computational Linguistics, 25(4):607–615.
Koehn, P. (2003). Noun Phrase Translation. PhD thesis, University of Southern
California, Los Angeles.
Koehn, P. and Knight, K. (2003). Feature-rich translation of noun phrases. In
41st Annual Meeting of the Association of Computational Linguistics (ACL).
Koehn, P., Och, F. J., and Marcu, D. (2003). Statistical phrase based translation.
In Proceedings of HLT-NAACL.
Kumar, S. and Byrne, W. (2004). Minimum bayes-risk decoding for statistical
machine translation. In Proceedings of HLT-NAACL.
Marcu, D. and Wong, D. (2002). A phrase-based, joint probability model for
statistical machine translation. In Proceedings of EMNLP, pages 133–139.
Och, F. J. (2002). Statistical Machine Translation: From Single-Word Models to
Alignment Templates. PhD thesis, RWTH Aachen, Germany.
Och, F. J., Gildea, D., Sarkar, A., Khudanpur, S., Yamada, K., Fraser, A., Shen,
L., Kumar, S., Smith, D., Jain, V., Eng, K., Jin, Z., and Radev, D. (2003). Syn-
tax for machine translation. Technical report, John Hopkins University Sum-
mer Workshop http://www.clsp.jhu.edu/ws2003/groups/translate/.
Och, F. J. and Ney, H. (2002). Discriminative training and maximum entropy
models for statistical machine translation. In Proceedings of ACL.
Och, F. J. and Ney, H. (2003). A systematic comparison of various statistical
alignment models. Computational Linguistics, 29(1):19–52.
Russel, S. and Norvig, P. (1995). Artificial Intelligence: A Modern Approach.
Prentice Hall, New Jersey.
Tillmann, C. (2001). Word Re-Ordering and Dynamic Programming based Search
Algorithm for Statistical Machine Translation. PhD thesis, RWTH Aachen,
Germany.
Tillmann, C. (2003). A projection extension algorithm for statistical machine
translation. In Proceedings of EMNLP, pages 1–8.
Ueffing, N., Och, F. J., and Ney, H. (2002). Generation of word graphs in
statistical machine translation. In Proceedings of EMNLP, pages 156–163,
Philadelphia. Association for Computational Linguistics.
Vogel, S., Zhang, Y., Huang, F., Tribble, A., Venugopal, A., Zhao, B., and
Waibel, A. (2003). The CMU statistical machine translation system. In Pro-
ceedings of the Ninth Machine Translation Summit.
Zens, R., Och, F. J., and Ney, H. (2002). Phrase-based statistical machine
translation. In Proceedings of the German Conference on Artificial Intelligence
(KI 2002).
The PARS Family of Machine Translation Systems for
Dutch System Description/Demonstration
R.E. Frederking and K.B. Taylor (Eds.): AMTA 2004, LNAI 3265, pp. 125–129, 2004.
© Springer-Verlag Berlin Heidelberg 2004
126 E.A. Kool et al.
The dictionary editor is user friendly: in particular, it lets the lexicographer have a
Dutch verb semi-automatically tagged. For tagging the Dutch words entered into the
dictionaries in the PARS/H MT system, we use
several tools, one of them being the automatic verb encoding mechanism. It is based
on the Dutch morphology description developed in the framework of the PARS
project, and recognizes the Dutch verb conjugation paradigm and alteration(s).
For example, after the Dutch verb aandringen has been entered into the dictionary,
and the POS = Verb selected, the dictionary editor prompts that aan is a separable
prefix. The dictionary officer confirms this and selects Irregular for the Conjugation
type. After that, the dictionary editor automatically assigns the conjugation paradigm
to the verb including the alterations. The paradigm is not displayed for the dictionary
officer but is used by the program for morphological analysis and synthesis.
Fig. 1.
The PARS Family of Machine Translation Systems 127
When tagging Dutch nouns, the lexicographer selects the relevant values for the
corresponding noun features, such as Gender, endings, and alteration type.
It is important to note that creating a dictionary, one, in fact, builds two of them: an
English to Dutch one as well as a Dutch to English one. This is achieved by activating
the so-called Mirror option in the Dictionary Editor. When this option is ON, both
dictionary parts are created, but, when it is OFF, the reverse correspondence is not
entered into the dictionary.
Another option lets the lexicographer transpose the translations in the dictionary
entry to place the most “relevant” one at the top. By doing so, the lexicographer
predicts which of the translations should be used by the translation program and put
into the target text, while the others will be provided as translation variants – a
characteristic feature of the PARS MT systems.
Fig. 2.
2 Grammar Rules
target Dutch text. Besides, a special set of rules is aimed at decomposing the Dutch
composite nouns.
This Dutch analysis and synthesis apparatus can and will be laid in the foundation
of a family of Dutch MT systems, such as (both
under way), and others.
In addition to the morphological analyzer for translating to and from the Dutch
language, two grammar books are used to specify and test the PARS
MT program [3,4].
Those grammar books have 1250 rules altogether, covering various issues ranging
from spelling to irregular verbs in Dutch, and Present Indefinite to phrasal verbs. We
are developing a translation program that would translate between English and Dutch
according to those rules. Implementing the rules makes it possible to get what is
usually called “draft translation”.
3 Functionality
PARS/H translates between Dutch and English in the following modes: from
Clipboard to Clipboard, drag-and-drop, and from file to file. Besides, it is fully
integrated with MS Word and Internet Explorer, preserving the source text and source
web page formatting. For example, here is a web page automatically translated from
English into Dutch:
Fig. 3.
The PARS Family of Machine Translation Systems 129
References
1. Michael S. Blekhman. Slavic Morphology and Machine Translation. Multilingual. Volume
14, Issue 4, 2003, 28-31.
2. Michael Blekhman et al. A New Family of the PARS Translation Systems. In: Machine
Translation: From Research to Real Users, 5th Conference of the Association for Machine
Translation in the Americas, AMTA 2002 Tiburon, CA, USA, October 6-12, 2002,
Proceedings. Lecture Notes in Computer Science 2499 Springer 2002, 232-236.
3. Nederlandse grammatica voor anderstaligen. (Dutch grammar for non-native speakers).
ISBN 90 5517 014 3.
4. English Grammar in Use. Cambridge University Press. ISBN 0 521 43680 X.
Rapid MT Experience in an LCTL (Pashto)
Craig Kopris
Senior Linguist
Applications Technology, Inc.
1 Introduction
A year ago we were faced with a challenge: rapidly develop a machine translation
(MT) system for written Pashto with limited resources. We had three full-time native
speakers (one with a Ph.D. in general linguistics, and translation experience) and one
part-time descriptive linguist with a typological-functional background. In addition,
we had a legacy MT software system, which neither the speakers nor the linguist was
familiar with, although we had the opportunity to occasionally confer with
experienced system users. There were also dated published grammars of varying
(usually inadequate) quality available.
Here we will describe some of our experiences in developing a grammatical
analysis, corpus, and lexicon in an implementable fashion, despite the handicaps.
R.E. Frederking and K.B. Taylor (Eds.): AMTA 2004, LNAI 3265, pp. 130–133, 2004.
© Springer-Verlag Berlin Heidelberg 2004
Rapid MT Experience in an LCTL (Pashto) 131
1
Technically in RRG, intransitives also have an Actor or Undergoer. However, as there is only
one argument available, the choice, or rather lack thereof, is irrelevant.
132 C. Kopris
3 Corpus
Our corpus started off with sentences elicited to answer typical fieldwork questions
about the general nature of the language, such as word order, alignment, and clause
combining techniques. These were extended at first with a set of artificial sentences
showing interesting phenomena, starting relatively simple and building towards
greater complexity. After the first few hundred such test sentences, we moved on to
genuine texts, especially covering newspaper and journal articles.
Sentences from all sources were analyzed in a format inherited from previous work
on other languages. Excel spreadsheets were set up containing each example sentence
in a column, followed by each word of that sentence in rows in the next column.
Additional columns were used for coding morphosyntactic and semantic features of
each word, such as case, number, gender, tense, and aspect. These could thus be later
used as additions to the lexicon, if not already present. Perl scripts were also
developed to take raw original texts and reformat them into separate sentences and
words so that only the morphosyntactic and semantic tagging needed to be added.
Rapid MT Experience in an LCTL (Pashto) 133
4 Lexicon
Lexicon development began with compilation by native speakers of various public
domain dictionaries, further amplified by lexical items from the corpus not already
part of the dictionaries. The lexicon was prepared in an Excel spreadsheet format,
with each entry on a separate row. Homographs were each given their own entry.
Columns were used to indicate various morphosyntactic and semantic features, not to
mention the translations, similarly to the feature coding of individual words in the
corpus. Pashto speakers then went through the lexicon, correcting entries. For
instance, in many cases dictionaries gave obsolete terms, or forms limited to specific
dialects. Some information, such as conjugation class, was not part of the original
lexical sources (whether dictionary or corpus), and had to be added. Translations were
checked with English speakers.
Excel macros which had previously been developed for the Arabic MT system
were then used to convert the lexicon into the specific formalism used by the MT
software.
1 Introduction
Automatic Metrics for machine translation (MT) evaluation have been receiv-
ing significant attention in the past two years, since IBM’s BLEU metric was
proposed and made available [1]. BLEU and the closely related NIST metric [2]
have been extensively used for comparative evaluation of the various MT sys-
tems developed under the DARPA TIDES research program, as well as by other
MT researchers. Several other automatic metrics for MT evaluation have been
proposed since the early 1990s. These include various formulations of measures
of “edit distance” between an MT-produced output and a reference translation
[3] [4], and similar measures such as “word error rate” and “position-independent
word error rate” [5], [6].
The utility and attractiveness of automatic metrics for MT evaluation has
been widely recognized by the MT community. Evaluating an MT system using
such automatic metrics is much faster, easier and cheaper compared to human
evaluations, which require trained bilingual evaluators. In addition to their utility
for comparing the performance of different systems on a common translation
task, automatic metrics can be applied on a frequent and ongoing basis during
system development, in order to guide the development of the system based on
concrete performance improvements.
In this paper, we present a comparison between the widely used BLEU and
NIST metrics, and a set of easily computable metrics based on unigram precision
and recall. Using several empirical evaluation methods that have been proposed
R.E. Frederking and K.B. Taylor (Eds.): AMTA 2004, LNAI 3265, pp. 134–143, 2004.
© Springer-Verlag Berlin Heidelberg 2004
The Significance of Recall in Automatic Metrics for MT Evaluation 135
in the recent literature as concrete means to assess the level of correlation of au-
tomatic metrics and human judgments, we show that higher correlations can be
obtained with fairly simple and straightforward metrics. While recent researchers
[7] [8] have shown that a balanced combination of precision and recall (F1 mea-
sure) has improved correlation with human judgments compared to BLEU and
NIST, we claim that even better correlations can be obtained by assigning more
weight to recall than to precision. In fact, our experiments show that the best
correlations are achieved when recall is assigned almost all the weight. Previous
work by Lin and Hovy [9] has shown that a recall-based automatic metric for
evaluating summaries outperforms the BLEU metric on that task. Our results
show that this is also the case for evaluation of MT. We also demonstrate that
stemming both MT-output and reference strings prior to their comparison, which
allows different morphological variants of a word to be considered as “matches”,
significantly further improves the performance of the metrics.
We describe the metrics used in our evaluation in Section 2. We also discuss
certain characteristics of the BLEU and NIST metrics that may account for the
advantage of metrics based on unigram recall. Our evaluation methodology and
the data used for our experimentation are described in section 3. Our experiments
and their results are described in section 4. Future directions and extensions of
this work are discussed in section 5.
2 Evaluation Metrics
The metrics used in our evaluations, in addition to BLEU and NIST, are based on
explicit word-to-word matches between the translation being evaluated and each
of one or more reference translations. If more than a single reference translation
is available, the translation is matched with each reference independently, and
the best-scoring match is selected. While this does not allow us to simultaneously
match different portions of the translation with different references, it supports
the use of recall as a component in scoring each possible match. For each metric,
including BLEU and NIST, we examine the case where matching requires that
the matched word in the translation and reference be identical (the standard
behavior of BLEU and NIST), and the case where stemming is applied to both
strings prior to the matching1. In the second case, we stem both translation
and references prior to matching and then require identity on stems. We plan
to experiment in the future with less strict matching schemes that will consider
matching synonymous words (with some cost), as described in section 5.
translation being evaluated and a set of one or more reference translations. The
main component of BLEU is n-gram precision: the proportion of the matched n-
grams out of the total number of n-grams in the evaluated translation. Precision
is calculated separately for each n-gram order, and the precisions are combined
via a geometric averaging. BLEU does not take recall into account directly.
Recall – the proportion of the matched n-grams out of the total number of
n-grams in the reference translation, is extremely important for assessing the
quality of MT output, as it reflects to what degree the translation covers the
entire content of the translated sentence. BLEU does not use recall because
the notion of recall is unclear when simultaneously matching against multiple
reference translations (rather than a single reference). To compensate for recall,
BLEU uses a Brevity Penalty, which penalizes translations for being “too short”.
The NIST metric is conceptually similar to BLEU in most aspects, including the
weaknesses discussed below:
The Lack of Recall: We believe that the brevity penalty in BLEU does
not adequately compensate for the lack of recall. Our experimental results
strongly support this claim.
Lack of Explicit Word-matching Between Translation and Refer-
ence: N-gram counts don’t require an explicit word-to-word matching, but
this can result in counting incorrect “matches”, particularly for common
function words. A more advanced metric that we are currently developing
(see section 4.3) uses the explicit word-matching to assess the grammatical
coherence of the translation.
Use of Geometric Averaging of N-grams: Geometric averaging results
in a score of “zero” whenever one of the component n-gram scores is zero.
Consequently, BLEU scores at the sentence level can be meaningless. While
BLEU was intended to be used only for aggregate counts over an entire
test-set (and not at the sentence level), a metric that exhibits high levels
of correlation with human judgments at the sentence level would be highly
desirable. In experiments we conducted, a modified version of BLEU that
uses equal-weight arithmetic averaging of n-gram scores was found to have
better correlation with human judgments at both the sentence and system
level.
where is the number of words in the translation that match words in the
reference translation, and is the number of words in the translation. This
may be interpreted as the fraction of the words in the translation that are
present in the reference translation.
The Significance of Recall in Automatic Metrics for MT Evaluation 137
6. with Stemming: Same as above, but using the stemmed version of both
precision and recall.
7. Fmean: This is similar to but recall is weighted nine times more heavily
than precision. The precise amount by which recall outweighs precision is
less important than the fact that most of the weight is placed on recall. The
balance used here was estimated using a development set of translations and
references (we also report results on a large test set that was not used in any
way to determine any parameters in any of the metrics). Fmean is calculated
as follows:
4 Metric Evaluation
We first compare the various metrics in terms of the correlation they have with
total human scores at the system level. For each metric, we plot the metric
and total human scores assigned to each system and calculate the correlation
coefficient between the two scores. Tables 1 and 2 summarize the results for
the various metrics on the 2002 and 2003 data sets. All metrics show much
higher levels of correlation with human judgments on the 2003 data, compared
with the 2002 data. The 2002 data exhibits several anomalies that have been
identified and discussed by several other researchers [13]. Three of the 2002
systems have output that contains significantly higher amounts of “noise” (non
ascii characters) and upper-cased words, which are detrimental to the automatic
metrics. The variability within the 2002 set is also much higher than within the
2003 set, as reflected by the confidence intervals of the various metrics.
The levels of correlation of the different metrics are quite consistent across
both 2002 and 2003 data sets. Unigram-recall and F-mean have significantly
higher levels of correlation than BLEU and NIST. Unigram-precision, on the
other hand, has a poor level of correlation. The performance of F1 is inferior
to F-mean on the 2002 data. On the 2003 data, F1 is inferior to Fmean, but
stemmed F1 is about equivalent to Fmean. Stemming improves correlations for
all metrics on the 2002 data. On the 2003 data, stemming improves correlation
on all metrics except for recall and Fmean, where the correlation coefficients
are already so high that stemming no longer has a statistically significant effect.
Recall, Fmean and NIST also exhibit more stability than the other metrics, as
reflected by the confidence intervals.
140 A. Lavie, K. Sagae and S. Jayaraman
We next calculated the score differentials for each pair of systems that were
evaluated and assessed the correlation between the automatic score differentials
and the human score differentials. The results of this evaluation are summarized
in Tables 3 and 4. The results of the system pair differential correlation experi-
ments are very consistent with the system-level correlation results. Once again,
Unigram-recall and F-mean have significantly higher levels of correlation than
BLEU and NIST. The effects of stemming are somewhat less pronounced in this
evaluation.
4.3 Discussion
It is clear from these results that unigram-recall has a very strong correlation
with human assessment of MT quality, and stemming often strengthens this
correlation. This follows the intuitive notion that MT system output should
contain as much of the system output should contain as much of the meaning
of the input as possible. It is perhaps surprising that unigram-precision, on the
other hand, has such low correlation. It is still important, however, to factor
precision into the final score assigned to a system, to prevent systems that output
very long translations from receiving inflated scores (as an extreme example, a
system that outputs every word in its vocabulary for every translation would
consistently score very high in unigram recall, regardless of the quality of the
translation). Our Fmean metric is effective in combining precision and recall.
Because recall is weighted heavily, the Fmean scores have high correlations. For
both data sets tested, recall and Fmean performed equally well (differences were
statistically insignificant), even though precision performs much worse. Because
we use a weighted harmonic mean, where precision and recall are multiplied, low
The Significance of Recall in Automatic Metrics for MT Evaluation 141
levels of precision properly penalize the Fmean score (thus disallowing the case
of a system scoring high simply by outputting many words).
One feature of BLEU and NIST that is not included in simple unigram-
based metrics is the approximate notion of word order or grammatical coherence
achieved by the use of higher-level n-grams. We have begun development of a new
metric that combines the Fmean score with an explicit measure of grammatical
coherence. This metric, METEOR (Metric for Evaluation of Translation with
Explicit word Ordering), performs a maximal-cardinality match between trans-
lations and references, and uses the match to compute a coherence-based penalty.
This computation is done by assessing the extent to which the matched words
between translation and reference constitute well ordered coherent “chunks”.
Preliminary experiments with METEOR have yielded promising results, achiev-
ing similar levels of correlation (but so far not statistically significantly superior)
as compared to the simpler measures of Fmean and recall.
Combining Precision, Recall and Sort Penalty: Results so far indicate that recall
plays the most important role in obtaining high-levels of correlation with human
judgments. We are currently exploring alternative ways for combining the com-
ponents of precision, recall and a coherence penalty with the goal of optimizing
correlation with human judgments, and exploring whether an optimized combi-
nation of these factors on one data set is also persistent in performance across
different data sets.
The Utility of Multiple Reference Translations: The metrics described use mul-
tiple reference translations in a weak way: we compare the translation with each
reference separately and select the reference with the best match. This was nec-
essary in order to incorporate recall in our metric, which we have shown to be
highly advantageous. We are in the process of quantifying the utility of multiple
reference translations across the metrics by measuring the correlation improve-
ments as a function of the number of reference translations. We will then consider
exploring ways in which to improve our matching against multiple references.
Recent work by Pang, Knight and Marcu [14] provides the mechanism for pro-
ducing semantically meaningful additional “synthetic” references from a small
set of real references. We plan to explore whether using such synthetic references
can improve the performance of our metric.
Matched Words are not Created Equally: Our current metrics treats all matched
words between a system translation and a reference equally. It is safe to assume,
however, that matching semantically important words should carry significantly
more weight than the matching of function words. We plan to explore schemes
for assigning different weights to matched words, and investigate if such schemes
can further improve the sensitivity of the metric and its correlation with human
judgments of MT quality.
The Significance of Recall in Automatic Metrics for MT Evaluation 143
References
1. Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a
Method for Automatic Evaluation of Machine Translation. In Proceedings of 40th
Annual Meeting of the Association for Computational Linguistics (ACL), pages
311–318, Philadelphia, PA, July.
2. Doddington, George. 2002. Automatic Evaluation of Machine Translation Quality
Using N-gram Co-Occurrence Statistics. In Proceedings of the Second Conference
on Human Language Technology (HLT-2002). San Diego, CA. pp. 128–132.
3. K.-Y. Su, M.-W. Wu, and J.-S. Chang. 1992. A New Quantitative Quality Measure
for Machine Translation Systems. In Proceedings of the fifteenth International
Conference on Computational Linguistics (COLING-92). Nantes, France. pp. 433–
439.
4. Y. Akiba, K. Imamura, and E. Sumita. 2001. Using Multiple Edit Distances to
Automatically Rank Machine Translation Output. In Proceedings of MT Summit
VIII. Santiago de Compostela, Spain. pp. 15–20.
5. S. Niessen, F. J. Och, G. Leusch, and H. Ney. 2000. An Evaluation Tool for Machine
Translation: Fast Evaluation for Machine Translation Research. In Proceedings
of the Second International Conference on Language Resources and Evaluation
(LREC-2000). Athens, Greece. pp. 39–45.
6. Gregor Leusch, Nicola Ueffing and Herman Ney. 2003. String-to-String Distance
Measure with Applications to Machine Translation Evaluation. In Proceedings of
MT Summit IX. New Orleans, LA. Sept. 2003. pp. 240–247.
7. I. Dan Melamed, R. Green and J. Turian. 2003. Precision and Recall of Machine
Translation. In Proceedings of HLT-NAACL 2003. Edmonton, Canada. May 2003.
Short Papers: pp. 61–63.
8. Joseph P. Turian, Luke Shen and I. Dan Melamed. 2003. Evaluation of Machine
Translation and its Evaluation. In Proceedings of MT Summit IX. New Orleans,
LA. Sept. 2003. pp. 386–393.
9. Chin-Yew Lin and Eduard Hovy. 2003. Automatic Evaluation of Summaries Using
N-gram Co-occurrence Statistics. In Proceedings of HLT-NAACL 2003. Edmonton,
Canada. May 2003. pp. 71–78.
10. C. van Rijsbergen. 1979. Information Retrieval. Butterworths. London, England.
2nd Edition.
11. Deborah Coughlin. 2003. Correlating Automated and Human Assessments of
Machine Translation Quality. In Proceedings of MT Summit IX. New Orleans,
LA. Sept. 2003. pp. 63–70.
12. Bradley Efron and Robert Tibshirani. 1986. Bootstrap Methods for Standard Er-
rors, Confidence Intervals, and Other Measures of Statistical Accuracy. Statistical
Science, 1(1). pp. 54–77.
13. George Doddington. 2003. Automatic Evaluation of Language Translation us-
ing N-gram Co-occurrence Statistics. Presentation at DARPA/TIDES 2003 MT
Workshop. NIST, Gathersberg, MD. July 2003.
14. Bo Pang, Kevin Knight and Daniel Marcu. 2003. Syntax-based Alignment of
Multiple Translations: Extracting Paraphrases and Generating New Sentences. In
Proceedings of HLT-NAACL 2003. Edmonton, Canada. May 2003. pp. 102–109.
Alignment of Bilingual Named Entities in Parallel
Corpora Using Statistical Model
1 Introduction
Named Entities (NEs) make up a bulk of text documents. Extracting and translating
NE is vital for research areas in natural language processing-related topics, including
machine translation, cross-language information retrieval, and bilingual lexicon con-
struction. At the 7th Message Understanding Conference (MUC-7), the NE task [6]
consisted of three subtasks: entity names (organizations (ORG), persons (PER), and
locations (LOC)), temporal expressions (dates, times), and number expressions
(monetary values, percentages). We will focus on extracting the first type of NE
phrases, i.e. entity names.
Although many investigators have reported on the NE identification within mono-
lingual documents, the feasibility of extracting interlingual NEs has seldom been
addressed owing to the complexity of the task. Al-Onaizan and Knight [1] proposed
an algorithm to translate NEs from Arabic to English using monolingual and bilingual
R.E. Frederking and K.B. Taylor (Eds.): AMTA 2004, LNAI 3265, pp. 144–153, 2004.
© Springer-Verlag Berlin Heidelberg 2004
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model 145
resources. Huang and Vogel [8] proposed an iterative approach to extract English-
Chinese NE pairs from bilingual corpora using a statistical NE alignment model.
Chen et al. [5] investigated formulation and transformation rules for English-Chinese
NEs. Moore [13] proposed an approach to choose English-French NE pairs in parallel
corpora using a multiple progressively refined models. Kumano et al. [9] presented a
method for acquiring English-Japanese NE pairs from parallel corpora. In our previ-
ous work, Lee et al. [11] proposed an approach using a phrase translation model and a
transliteration model to align English-Chinese NE pairs in parallel corpora. In this
paper, we improve on that previous work by integrating statistical models with ap-
proximate matching and additional knowledge source to align English-Chinese NE
pairs in parallel corpora. Experimental results show greatly improved performance.
2 Baseline Method
For the sake of comparison, we first describe a baseline method that directly utilizes a
phrase translation model and a transliteration model to align bilingual NE pairs in
parallel corpora. Then, we describe a modified phrase translation model and a scoring
formula to determine the alignment scores of NE pairs.
In our previous work, Chang et al. [4] proposed a statistical translation model to per-
form noun phrase translation and Lee and Chang [10] proposed a statistical translit-
eration model to conduct machine transliteration. We briefly describe the above mod-
els in this subsection.
Based on the above model, finding the best translation for a given e, is the fol-
lowing:
For simplicity, the best alignment with the highest probability is chosen to decide the
most probable translation instead of summing all possible alignments a. Thus, we
have:
146 C.-J. Lee, J.S. Chang, and T.C. Chuang
For example, consider the case where the source phrase, E = “Ichthyosis Concern
Association” and its Chinese phrase of translation, The
correct alignment is Thus, the phrase translation probability is
represented as:
Based on the modified formulation for alignment probability, the best translation
for a given e, is performed by Eq. (5):
The integrated score function for the target phrase f, given e, is defined as follows, by
regarding the score function as a log probability function.
statistics for both languages. The log likelihood ratio statistics is employed for two
consecutive words in both languages. Finally, we deploy content word alignment
based on the Competitive Linking Algorithm [12].
For the purpose of not introducing too much noise, only bilingual phrases with
high probabilities are considered. The integrated lexical translation probability
is estimated using the linear interpolation strategy:
From the above definitions and independent assumptions, Eq. (10) is rewritten as:
148 C.-J. Lee, J.S. Chang, and T.C. Chuang
However, to reduce the amount of computation, the process of finding the most prob-
able transliteration for a given E, can be approximated as:
Then, by regarding the score function as a log probability function, the transliteration
score function for F, given E, is defined as
Step 2 (Recursion):
Step 3 (Termination):
where and are penalty score values for a insertion operation and a deletion
operation on the word level.
Suppose that there is an entry with probability p (or score in
the bilingual dictionary. is formulated as:
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model 149
3 Improvement
In this section, we incorporate multiple knowledge sources to improve on Phrase
Translation Model and Transliteration Model, including approximate matching and
Chinese personal name recognition.
where and stand for the two-character given name, the first char-
acter of the two-character given name, and the second character of the two-character
given name, respectively, and and are constants.
The decision function for the single-character given name is defined as fol-
lows:
5 Experiments
Several corpora were collected to estimate the parameters of the proposed models.
Noun phrases of the BDC Electronic Chinese-English Dictionary [2] were used to
train PTM. The LDC Central News Agency Corpus was used to extract keywords of
entity names for identifying NE types. We collected 117 bilingual keyword pairs
from the corpora. In addition, a list of Chinese surnames was also gathered to help to
identify and extract the PER-type of NEs. To train the transliteration model, 2,430
pairs of English names together with their Chinese transliterations and Chinese Ro-
manization were used. The parallel corpus collected from the Sinorama Magazine
was used to construct the corpus-based lexicon and estimation of LTP. Test cases
152 C.-J. Lee, J.S. Chang, and T.C. Chuang
were also drawn from the Sinorama Magazine to evaluate the performance of bilin-
gual NE alignment.
Performance on the alignment of NEs is evaluated according to the precision rate
at the NE phrase level. To analyze and scrutinize the performance of the proposed
methods for NE alignment, we randomly selected 275 aligned sentences from Sino-
rama and manually prepared the answer keys. Each chosen aligned sentence contains
at least one NE pair. Currently, we restrict the lengths of English NEs to be less than
6. In total, 830 pairs of NEs are labeled. The numbers of NE pairs for types PER,
LOC, and ORG are 208, 362, and 260, respectively. Several experiments were con-
ducted to analyze relevant contributions of corresponding enhancements to the per-
formance with respective to the baseline. We add each feature individually to the
baseline method and then combine all features altogether to the baseline to examine
their effects on performance. Experimental results are shown in Table 1.
In Table 1, the results reveal that each knowledge feature has different contribu-
tions to different NE types. As shown in Table 1, the performance of the enhanced
method is much better than that of the baseline. Furthermore, experimental results
indicate that the LOC-type test has the best performance and ORG ones are the worst.
The reason for lower performance of ORG is largely due to its highly complex struc-
ture and variation.
6 Conclusions
In this paper, a new method for bilingual NE alignment is proposed. Experiments
show that the baseline method aligns bilingual NEs reasonably well in parallel cor-
pora. Moreover, to improve performance, a unified framework has investigated to
perform the task by incorporating proposed statistical models with multiple knowl-
edge sources, including approximate matching and Chinese personal name recogni-
tion. Experimental results demonstrate that the unified framework can achieve a very
significant improvement.
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model 153
References
1. Al-Onaizan, Yaser, Kevin Knight: Translating named entities using monolingual and
bilingual resources. In Proceedings of ACL 40 (2002) 400-408.
2. BDC: The BDC Chinese-English electronic dictionary (version 2.0). Behavior Design
Corporation, Taiwan (1992).
3. Brown, P. F., Della Pietra S. A., Della Pietra V. J., Mercer R. L.: The mathematics of
statistical machine translation: parameter estimation. Computational Linguistics, 19 (2)
(1993) 263-311.
4. Chang, Jason S., David Yu, Chun-Jen Lee: Statistical Translation Model for Phrases.
Computational Linguistics and Chinese Language Processing, 6 (2) (2001) 43-64.
5. Chen, Hsin-Hsi, Changhua Yang, Ying Lin: Learning formulation and transformation
rules for multilingual named entities. In Proceedings of the ACL Workshop on Multilin-
gual and Mixed-language Named Entity Recognition (2003), 1-8.
6. Chinchor, Nancy: MUC-7 Named entity task definition. In Proceedings of the 7th Mes-
sage Understanding Conference (1997).
7. Damerau, F.: A technique for computer detection and correction of spelling errors. Comm.
of the ACM, 7(3), (1964), 171-176.
8. Huang, Fei, Stephan Vogel: Improved Named Entity Translation and Bilingual Named
Entity Extraction. In Proceedings of Int. Conf. on Multimodal Interfaces (2002), 253-260.
9. Kumano, Tadashi, Hideki Kashioka, Hideki Tanaka, Takahiro Fukusima: Acquiring Bi-
lingual Named Entity Translations from Content-aligned Corpora. In Proceedings of the
First International Joint Conference on Natural Language Processing (2004).
10. Lee, Chun-Jen, Jason S. Chang: Acquisition of English-Chinese transliterated word pairs
from parallel-aligned texts using a statistical machine transliteration model. In Proceedings
of HLT-NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven Ma-
chine Translation and Beyond (2003), 96-103.
11. Lee, Chun-Jen, Jason S. Chang, Jyh-Shing Roger Jang: Bilingual named-entity pairs
extraction from parallel corpora. In Proceedings of IJCNLP-04 Workshop on Named En-
tity Recognition for Natural Language Processing Applications, (2004), 9-16.
12. Melamed, I. Dan: A Word-to-Word Model of Translational Equivalence. In Proceeding of
ACL 35, (1997) 490-497.
13. Moore, Robert C.: Learning Translations of Named-Entity Phrases from Parallel Corpora.
In Proceedings of the 10th Conference of the European Chapter of the Association for
Computational Linguistics (2003) 259-266.
14. Wu, Chien-Cheng, Jason S. Chang: Bilingual Collocation Extraction Based on Syntactic
and Statistical Analyses. In Proceedings of ROCLING XV (2003) 33-55.
Weather Report Translation Using
a Translation Memory
RALI/DIRO
Université de Montréal
C.P. 6128, succursale Centre-ville
H3C 3J7, Montréal, Québec, Canada
http://www.rali-iro.umontreal.ca
1 Introduction
R.E. Frederking and K.B. Taylor (Eds.): AMTA 2004, LNAI 3265, pp. 154–163, 2004.
© Springer-Verlag Berlin Heidelberg 2004
Weather Report Translation Using a Translation Memory 155
This system has been in continuous use since 1984, translating up to 45,000
words a day. [7] argues that one of the reasons for the success of the MÉTÉO
system is the nature of the problem itself: a specific domain, with very repetitive
texts that are particularly unappealing to translate for a human (see for example
the reports shown in Figure 1). Furthermore, the life of a weather report is,
by nature, very short (approximatively 6 hours), which makes them an ideal
candidate for automation.
Professional translators are asked to correct the machine output when the
input English text cannot be parsed, often because of spelling errors in the
original English text. MÉTÉO is one of the very few machine translation systems
in the world from which the unedited output is used by the public in everyday
life without any human revision.
Some alternatives to machine translation (MT) have been proposed for
weather reports, namely multilingual text generation directly from raw weather
data: temperatures, winds, pressures etc. Such generation systems also need
some human template selection for organising the report. Generating text in
many languages from one source is quite appealing from a conceptual point of
view and has been cited as one of the potential applications for natural language
generation [15]; some systems have been developed [9,6,4] and tested in opera-
tional contexts. But thus far, none has been used in everyday production to the
same level as the one attained by MT. One of the reasons for this is that mete-
orologists prefer to write their reports in natural language rather than selecting
text structure templates.
Our goal in this study was to determine how well a simple memory-based
approach would fit in the context of the weather report translation. We describe
in section 2 the data we received from Environment Canada and what prepro-
cessing we performed to obtain our MÉTÉO bitext. We present in section 4 the
first prototype developed and then report on the results obtained in section 5.
We analyse in section 6 the main kinds of errors that are produced by this ap-
proach. In section 7, we conclude with a general discussion of this study and
propose some possible extensions.
2 The Corpus
We obtained from Environment Canada forecast reports in both French and
English produced during 2002 and 2003. The current reports are available on
the web at http://meteo.ec.gc.ca/forecast/textforecast_f. html.
We used this corpus to populate a bitext i.e. an aligned corpus of corre-
sponding sentences in French and English weather reports. Like all work on real
data, this conceptually simple task proved to be more complicated than we had
initially envisioned. This section describes the major steps of this stage.
We received files containing both French and English weather forecasts. Both
the source report, usually in English, and its translation, produced either by
a human or by the current MÉTÉO system, appear in the same file. One file
contains all reports issued for a single day. A report is a fairly short text, on
156 T. Leplus, P. Langlais, and G. Lapalme
average 304 words, in a telegraphic style. All letters are capitalised and non
accented and almost always without any punctuation except for a terminating
period.
As can be seen in the example in Figure 1, there are few determiners such
as articles (a or the in English, le or un in French). A report usually starts with
a code identifying the source which issued the report. For example, in FPCN18
CWUL 312130, 312130 indicates that the report was produced at 21h30 on the
31st day of the month; CWUL is a code corresponding to Montreal and the western
area of Quebec. A report (almost always) ends with a closing markup: END or
FIN according on the language of the report. If the author or the translator is a
human, his or her initials are added after a slash following the markup.
Our first step is to determine the beginning and end of each weather forecast
using regular expressions to match the first line of a forecast, which identifies the
source which issued it, and the last line which usually starts with END or FIN.
Then we distinguish the English forecasts from the French ones according
to whether they ended with END or FIN. Given the fact that we started with a
fairly large amount of data, we decided to discard any forecast that we could
not identify with this process. We were left with 273 847 reports.
Next, we had to match English and French forecasts that are translations of
each other. As we see in fig. 1, the first line of the two reports is almost the same
except for the first part of the source identifier which is FPCN18 in English and
FPCN78 in French. After studying the data, we determined that this shift of 60
between English and French forecast identifiers seemed valid for identifiers from
FPCN10 through FPCN29. These identifiers being the most frequent, we decided
to keep only these into our final bitext.
This preprocessing stage required about 1 500 lines of Perl code and few weeks
of monitoring. Of the 561 megabytes of text we originally received, we were left
with only 439 megabytes of text, representing 89 697 weather report pairs.
To get a bitext out of this selected material, we first automatically segmented
the reports into words and sentences using an in-house tool that we did not try
to adapt to the specificity of the weather forecasts.
Weather Report Translation Using a Translation Memory 157
We then ran the Japa sentence aligner [11] on the corpus (this took around
2 hours running on a standard P4 workstation), to identify 4,4 million pairs of
sentences, from which we removed about 28 000 (roughly 0.6%) which were not
one-to-one sentence pairs.
We divided this bitext into three non-overlapping sections, as reported in
Table 1: TRAIN (January 2002 to October 2003) for populating the translation
memory, BLANC (December 2003) for tuning a few meta-parameters, and TEST
(November 2003) for testing.
The TEST section was deliberately chosen so as to be different from the TRAIN
period in order to recreate as much as possible the working environment of a
system faced with the translation of new weather forecasts.
A quick inspection of the bitext reveals that sentences are fairly short: an av-
erage of 7.2 English and 8.9 French words. Most sentences are repeated: only 8.6%
of the English sentences unique. About 90% of the sentences to be translated
can be retrieved verbatim from the memory with at most one edit operation i.e.
insertion, suppression and substitution of a word. These properties of our bitext
naturally suggest a memory-based approach for translating weather reports.
other hand, it is clear that the larger the memory, the better our chances will
be to find sentences we want to translate (or ones within a short edit distance),
even if these sentences were not frequent in the training corpus.
The percentage of sentences to translate found verbatim in the memory grows
logarithmically with the size of the memory until it reaches approximately 20 000
sentence pairs. With the full memory (about 300 000 source sentences), we obtain
a peek of 87% of sentences found into the memory.
The second parameter of the translation memory is the number of French
translations stored for each English sentence. Among the 488 792 different En-
glish sentences found in our training corpus, 437 418 (89.5%) always exhibit the
same translation. This is probably because most of the data we received from
Environment Canada is actually machine translated and has not been edited by
human translators. In practice, we found that considering a maximum of N = 5
translations for a given English sentence was sufficient for our purposes.
The target sentences are then ranked in increasing order of edit distance
between their associated English sentence and the source sentence Ties in
this distribution are broken by prefering larger counts
The translations produced are finally postprocessed in order to transform the
meta-tokens introduced during preprocessing into their appropriate word form.
We observed on a held-out test corpus that this cascade of pre- and postprocess-
ing clearly boosted the coverage of the memory.
Weather Report Translation Using a Translation Memory 159
Fig. 2. Performance of the engine as a function of the number of pair of sentences kept
in the memory. Each point corresponds to a frequency threshold (from 10 to 1) we
considered for filtering the training sentences. These rates are reported for sentences
of 10 words or less (1-10) and for 35 words of less (1-35).
5 Results
lations produced in the former case were identical with the reference, while only
15% were in the latter case.
More surprisingly, these figures tell us that our simple approach is a fairly
accurate way to translate MÉTÉO sentences, a WER of 9 being several times lower
than what we usually observe in “classical” translation tasks. However, we are
still below the performance that has been reported in [13]. The author manually
inspected a sample of 1257 translations produced by the MÉTÉO2 system and
determined that around 11% of them required a correction (minor or not). In our
case, and although the SER we measure does not compare directly, we observed
on a sample of 36228 translations, that around 24% of them are not verbatim
the reference ones.
We analyse in the following section the major errors produced by our ap-
proach.
6 Error Analysis
We analysed semi-automatically the most frequent errors produced by our pro-
cess for the 25% of the translations that differed from the reference. We (arbi-
trarily) selected one of the alignments with the minimum edit distance between
the reference translation and the erroneous candidate translation. From this
alignment, we identified locus of errors. This process is illustrated in Figure 3 in
which an error is indicated by the following notation SOURCE TARGET.
As the classical edit distance between two sequences behaves poorly in the
case of word reordering, some errors might not have been analyzed correctly
by our process (see the alignment in the last line of Figure 3). However, casual
inspection of the errors did not reveal severe problems.
Out of the 8900 errors found at the word level, we manually inspected the 100
most frequent ones (the first ten are reported in Table 3), covering 42.6% of the
errors. We found that more than 53% of the errors at the word sequence level
were replacement errors, such as the production of the word LEGER (light)
instead of the expected word FAIBLE (light). Around 30% were suppression
errors, i.e. portions of text that have not been produced, but that should have
Weather Report Translation Using a Translation Memory 161
Fig. 3. Illustration of the error detection process on four sentences making use of
the edit distance alignment between the reference and the candidate translations. I
indicates an insertion, S a substitution, D a deletion, and = the identity of two words.
Errors detected are noted as reference sequence candidate sequence.
7 Discussion
In this paper, we have described a translation memory based approach for the
recreation of the MÉTÉO system nearly 30 years after the birth of the first
prototype at the same university. The main difference between these two systems
is the way they were developed. The original system is a carefully handcrafted
one based on a detailed linguistic analysis, whilst ours simply exploits a memory
of previous translations of weather forecasts that were, of course, not available
at the time the original system was designed. Computational resources needed
for implementing this corpus-based approach are also much bigger than what
was even imaginable when the first MÉTÉO system was developed.
This paper shows that a simple-minded translation memory system can pro-
duce translations that are comparable (although not as good) in quality with the
ones produced by the current system. Clearly, this prototype can be improved in
many ways. We have already shown that many errors could be handled by small
specific bilingual lexicons (place names, cardinals, etc.). Our translation memory
implementation is fairly crude compared to the current practice in example-based
machine translation [2], leaving a lot of room for improvements. We have also
started to investigate how this memory-based approach can be coupled with a
statistical machine translation system in order to further improve the quality of
the translations [12].
Acknowledgements. We thank Rick Jones and Marc Besner from the Dorval
office of Environment Canada who provided us with the corpus of bilingual
weather reports. We are also indebted to Elliot Macklovitch who provided us with
articles describing the MÉTÉO system that we could not have found otherwise.
This work has been funded by NSERC and FQRNT.
Weather Report Translation Using a Translation Memory 163
References
1. Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L.
Mercer. 1993. The mathematics of statistical machine translation: Parameter
estimation. Computational Linguistics, 19(2):263–311.
2. Michael Carl and Andy Way, editors. 2003. Recent Advances in Example-Based
Machine Translation, volume 21 of Text, Speech and Language Technology. Kluwer
Academic.
3. John Chandioux. 1988. Meteo (tm), an operational translation system. In RIAO.
4. J. Coch. 1998. Interactive generation and knowledge administration in multimeteo.
In Ninth International Workshop on Natural Language Generation, pages 300–303,
Niagara-on-the-lake, Ontario, Canada.
5. G. Foster, S. Gandrabur, P. Langlais, E. Macklovitch, P. Plamondon, G. Russell,
and M. Simard. 2003. Statistical machine translation: Rapid development with
limited resources. In Machine Translation Summit IX, New Orleans, USA, sep.
6. E. Goldberg, N. Driedger, and R. Kittredge. 1994. Using natural language pro-
cessing to produce weather forecasts. IEEE Expert 9, 2:45–53, apr.
7. Annette Grimaila and John Chandioux, 1992. Made to measure solutions, chap-
ter 3, pages 33–45. J. Newton, ed., Computers in Translation: A Practical Ap-
praisal, Routledge, London.
8. W. John Hutchins and Harold L. Somers, 1992. An introduction to Machine Trans-
lation, chapter 12, pages 207–220. Academic Press.
9. R. Kittredge, A. Polguère, and E. Goldberg. 1986. Synthesizing weather reports
from formatted data. In 11th. International Conference on Computational Lin-
guistics, pages 563–565, Bonn, Germany.
10. Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-
based translation. In Proceedings of the Second Conference on Human Language
Technology Reasearch (HLT), pages 127–133, Edmonton, Alberta, Canada, May.
11. Philippe Langlais, Michel Simard, and Jean Véronis. 1998. Methods and practical
issues in evaluating alignment techniques. In Proceedings of the 36th Annual Meet-
ing of the Association for Computational Linguistics (ACL), Montréal, Canada,
August.
12. Thomas Leplus, Philippe Langlais, and Guy Lapalme. 2004. A corpus-based Ap-
proach to Weather Report Translation. Technical Report, University of Montréal,
Canada, May.
13. Elliott Macklovitch. Personnal communication of results of a linguistic performance
evaluation of METEO 2 conducted in 1985.
14. S.Nießen, S. Vogel, H. Ney, and C. Tillmann. 1998. A dp based search algorithm
for statistical machine translation. In Proceedings of the 36th Annual Meeting of
the Association for Computational Linguistics (ACL) and 17th International Con-
ference on Computational Linguistics (COLING) 1998, pages 960–966, Montréal,
Canada, August.
15. Ehud Reiter and Robert Dale. 2000. Building Natural Language Generation Sys-
tems. Cambridge University Press. 270 p.
Keyword Translation from English to Chinese for
Multilingual QA
1 Introduction
Query translation plays an important role in Cross-Language Information Retrieval
(CLIR) and Multilingual Information Retrieval (MLIR) applications. Any application
that involves CLIR or MLIR requires translation; either the translation of the query
(usually a set of keywords) used to retrieve the documents or the translation of the docu-
ments themselves. Since translating documents is more expensive computationally, most
CLIR and MLIR systems chose query translation over document translation. Similarly,
the translation of keywords (words used to retrieve relevant documents and extract an-
swers) in a multilingual question-answering system is crucial when the question is in
one language and the answer lies within a document written in another language.
R.E. Frederking and K.B. Taylor (Eds.): AMTA 2004, LNAI 3265, pp. 164–176, 2004.
© Springer-Verlag Berlin Heidelberg 2004
Keyword Translation from English to Chinese for Multilingual QA 165
documents, and then to Answer Generator module, which generates an answer to the
question according to the information extracted. The keywords given by the Question
Analysis module are used both in retrieving documents and extracting information from
documents. Therefore, in multilingual question-answering, the Keyword Translator, as a
sub-module within the Question Analysis module, translates keywords produced by the
module into other languages so that the Retrieval Strategist module and the Information
Extractor module may use these keywords to find answers in these languages.
Typically, the quality of query translation in CLIR and MLIR is measured by the
performance of the CLIR or MLIR system as a whole (precision and recall). However,
rather than looking at the overall information retrieval or question-answering perfor-
mance, in this paper we will focus on the translation correctness of keywords for the
following reasons:
1. We want to isolate the translation problem from problems such as query expansion,
which can be dealt with separately in the Retrieval Strategist module.
2. We want to isolate the performance of the Keyword Translator so that each module
within the question-answering system can be evaluated individually.
3. In CLIR or MLIR systems, often many documents are retrieved for review, so it
may not be as important that all parts of the query be correct as that the correct parts
are weighed more heavily. Whereas in a question-answering system, often one or
few answers are produced for review, so it is crucial that each keyword is translated
correctly in the correct sense according to the context.
1. Dictionary-based approaches often require special resources that are either difficult
to obtain or domain-specific. On the other hand, general-purpose MT systems are
more available and accessible, and many can be accessed through the web for free.
2. Many general-purpose MT’s can translate from one language to many other lan-
guages, whereas dictionary-based approaches require a bilingual dictionary for each
language pair and can differ greatly in quality and coverage.
3. General-purpose MT provides better word and phrase coverage than general dictio-
naries.
The Keyword Translator is an open-domain, MT-based word (or phrase) translator for
a question-answering system. We choose the MT-based approach for reasons noted in
the previous section. The Keyword Translator has two distinguishing features: 1) it
uses multiple MT systems and tries to select one correct translation candidate for each
keyword and 2) it utilizes the question sentence available to a question-answering system
as a context in which to translate the word in the correct sense.
We choose to use multiple MT systems and to utilize the question sentence based on
the following assumptions:
Keyword Translation from English to Chinese for Multilingual QA 167
Using more than one MT system gives us a wider range of keyword translation
candidates to choose from, and the correct translation is more likely to appear in
multiple MT systems than a single MT system, and
Using the question sentence available to a Question-Answering system gives us a
context in which to better select the correct translation candidate.
Given the above assumptions, the Keyword Translator, follow these steps:
1. Get the question sentence and the keywords from the Question Analysis module.
2. Get keyword translation candidates for each keyword from multiple MT systems.
3. Use combinations of algorithms to score and/or penalize translation candidates from
each MT system.
4. Return a best-scoring translation for each keyword.
5. Send translated keywords back to the Question Analysis module.
Our focus in this paper is on step 3, where the translation candidates are scored so that
a correct translation can be selected in step 4. Based on our assumption, we conducted
an experiment to study ways to score translation candidates that would result in correct
keyword translations.
3 The Experiment
In this experiment we use three free web-based MT systems. We choose to use free
web-based MT systems because 1) They are easily accessible, and 2) They are free. We
study the performance (translation correctness) of the Keyword Translator using one,
two, and three MT systems with different keyword scoring algorithms, from English to
Chinese.
The scoring algorithm of the baseline model is simply to score keywords which
appear in two or more MT systems higher than those appear in only one MT system.
In the case where all three have the same score (in the baseline model, this happens
when all three MT systems give different translations), the translation given by the best-
performing individual system, S1, would be selected. And in the case where S2 and S3
tie for the highest scoring keyword, S2 is chosen over S3. This is so that the accuracy
of multiple MT’s would be at least as good as the best individual MT when there is a
tie in the scores. Table 2 shows the MT systems we choose for 1, 2, and 3-MT system
models for comparing the improvement of multiple MT systems over individual MT
Keyword Translation from English to Chinese for Multilingual QA 169
systems. Note that we choose the MT systems to use for evaluating one- MT and two-
MT performance based on Table 1, using the best MT first. This way the improvement
made by adding more MT’s is not inflated by adding MT’s with higher performance.
In the above example, both and appear in two MT systems, so they were
selected instead of and However, with the baseline model, using more MT
systems may not give us improvement over the 83.2% accuracy of the best individual
system. This is due to the following reasons: 1) For many keywords MT systems all
disagree with one another, so S1 is chosen by default, and 2) MT systems may agree
on the wrong translation. The results from the baseline model show that there is much
room for improvement in the scoring algorithm.
are not marked by spaces) using a combination of forward-most matching and backward-
most matching. Forward-most matching (FMM) is a greedy string matching algorithm
that starts from left to right and tries to find the longest substring of the string in a word
list (we use a word list of 128,455 Mandarin Chinese words compiled from different
resources). Backward-most matching (BMM) does the same thing from right to left. We
keep the words segmented from FMM and BMM in an array that we call “the segmented
sentence.” See Figure 3.
After we have the segmented Chinese sentence, we try to match the translated key-
words to the segmented sentence, and keywords that match the segmented sentence are
scored higher, since they are more likely to be translated in the context of the question
sentence.
A feature of the Chinese language is that words sharing the characters are often
semantically related. Using this idea, a translated keyword is considered to “partially
match” the segmented sentence if the keyword have characters in common with any
word in the segmented sentence (i.e. there exists a string that is both a substring of the
keyword and a substring of any word in the segmented sentence). A partially matched
keyword does not get as high a score as a fully matched word. So a fully matched word
would score higher than a partially matched word, and a partially matched word would
score higher than a word that does not match at all. As we have mentioned previously,
if keywords words have the same score, then by default, S1 is selected over S2 or S3,
and S2 is selected over S3. When a keyword translation partially matches a word in the
segmented sentence, it is the word in the segmented sentence that is used as the keyword.
Figure 4 shows examples of fully matched and partially matched words.
In order to solve the problem with the limited coverage of the word list, we also tried
word-matching on the entire un-segmented sentence. This is a simple string matching
to see if the translated keyword is a substring of the translated question sentence. There
are two variations to this metric; in the case where word-matching on the entire un-
segmented question sentence fails, we can either fall back to partial word-matching on
the segmented sentence or not fall back to partial word-matching. Figures 6 shows full
sentence word-matching with fall back to partial word-matching and Figure 7 shows full
sentence word-matching without fall back to partial word-matching:
In both Figure 6 and 7, and matches the question sentence but
does not. In Figure 6, partial-matching is used to match to therefore
is considered to be the partial-matched translation and is scored as a partial matched
keyword. In Figure 7, does not go through partial matching so no score is added
to the keyword.
Fig. 6. An example of full sentence word- Fig. 7. An example of full sentence word-
matching with fall back to partial word- matching without fall back to partial word-
matching. matching.
This is done by simply checking if [A-Z][a-z] appear in the keyword string. In Figure 8,
the score for the translation candidate Theresa” is penalized because it is not
a full translation. So is selected instead.
4.4 Scoring
Each keyword starts with an initial score of 0.0, and as different metrics are applied,
numbers are added to or subtracted from the score. Table 3 shows actual numbers used for
this experiment; the same scoring scheme is used by all scoring metrics when applicable.
The general strategy behind the scoring scheme is as follows: keywords with full
match (A) receive the highest score, keywords with partial match (B) receive the second
highest score, keywords supported by more than one MT system (C) receive the lowest
score, and keywords not fully translated (D) receive a penalty to their score. All of
the above can be applied to the same keyword except for A and B, since a keyword
cannot be both a full match and a partial match at the same time. Support by more
than one MT system receives the least score because in our experiment it has shown
Keyword Translation from English to Chinese for Multilingual QA 173
to be the least reliable indication of a correct translation. Full match has shown to be
the best indicator of a correct translation, therefore it receives the highest score and it is
higher than the combination of partial match and support by more than one MT system.
Keywords not fully translated should be penalized heavily, since they are generally
considered incorrect; therefore the penalty is set equal to the highest score possible,
the combination of A and C. This way, another translation that is a full translation has
the opportunity to receive a higher score. With the above general strategy in mind, the
numbers were manually tuned to give the best result on the training set.
5 Results
We construct different 7 models (including the baseline model) by combining various
scoring metrics. In all models we use the baseline metric that adds score to keyword
translation candidates that are supported by more than one MT system. Table 4 shows
the abbreviation for each metric, the description of each metric, the section of this paper
that describe each metric, and the scoring (refer to Table 3 column headings to look up
scoring) that applies to each metric:
Table 5 shows the percentage of keywords translated correctly using different models
on the training set, which consists of 125 keywords from 50 questions. Table 6 shows
the improvement of different models over the baseline model based on Table5:
Table 7 shows the percentage of keywords translated correctly using different models
on the test set, which consists of 147 keywords from 50 questions. Table 8 shows the
improvement over the baseline:
Note that all models which use only one MT system does not improve over the base-
line model because no improvement can be made when there is no alternative translations
to choose from. However, single-MT models can degrade due to partial word-matching
using segmented sentence. We will discuss problems with using segmented sentence and
other issues in the next section.
174 F. Lin and T. Mitamura
6 Discussions
From the results of different models on the training set and test set, we make the following
observations:
1. In almost all models, increasing the number of MT systems used either increase the
performance of the model or does nothing to the model; in other words, additional
MT systems do not seem to degrade translation correctness but has the potential to
improve translation correctness.
2. As shown in model B+F1, using word-matching on the translated question sentence
for sense disambiguation does improve translation correctness.
3. From results of models with S and F2 we see that scoring metrics requiring word
list segmentation not only does not improve the translation, they can degrade the
translation beyond the baseline model. Upon observation we see that this method,
though intuitively sound, relies on the word list to do the segmentation, and word
lists’ limited coverage degrades the translation greatly.
4. Full sentence word-matching (F1) with penalty for partially or un-translated key-
words (P) yields the best results. Although P does not improve the baseline by itself,
it boosts the performance of F1 greatly when combined.
From the above four points and other observations, we briefly describe the pros and
cons of using the different scoring metrics in Table 9. The asterisk indicates that this
experiment does not validate the statement due to the limited coverage of the word list
we used.
From Table 9 we can see why model B+F1+P with all three MT systems out performs
the others. It 1) uses three MT systems, 2) penalizes keywords that are not fully translated,
and 3) does word sense disambiguation without relying on segmentation which needs a
word list with adequate coverage, and such a word list may be difficult to obtain. Thus for
translating keywords using general MT systems, we can suggest that 1) it is better to use
Keyword Translation from English to Chinese for Multilingual QA 175
more MT systems if they are available, 2) always penalize un-translated words because
different MT systems have different word coverage, and 3) in a setting where resources
are limited (small word lists), it is better not to use methods involving segmentation.
7 Conclusion
In this paper, we first present the general problem of keyword translation in a multilingual
open-domain question-answering system. Then based on this general problem, we chose
an MT-based approach using multiple free web-based MT systems. And based on our
assumption that using multiple MT systems and the question sentence can improve
translation correctness, we present several scoring metrics that can be used to build
models that choose among keyword translation candidates. Using these models in an
experiment, we show that using multiple MT system and using the question sentence to
do sense disambiguation can improve the correctness of keyword translation.
References
1. Nyberg, E., Mitamura, T., Callan, J., Carbonell, J., Frederking, R., Collins-Thompson, K.,
Hiyakumoto, L., Huang, Y., Huttenhower, C., Judy, S., Ko, J., Kupse, A., Lita, L., Pedro,
V., Svoboda, D., and Van Durme, B.: The JAVELIN Question-Answering System at TREC
2003: A Multi-Strategy Approach with Dynamic Planning. In: Proceedings of the 12th Text
REtrieval Conference (2003).
2. Yang, Y. and Ma, N.: CMU in Cross-Language Information Retrieval at NTCIR-3. In: Pro-
ceedings of the Third NTCIR Workshop (2003).
3. Chen, H-H., Bian, G-W. and Lin, W-C.: Resolving Translation Ambiguity and Target Poly-
semy in Cross-Language Information Retrieval. In: Proceedings of the Third NTCIR Work-
shop (2002).
4. Chen, A., Jiang, H. and Gey, F.: Combining Multiple Sources for Short Query Translation in
Chinese- English Cross-Language Information Retrieval. In: Proceedings of the Fifth Inter-
national Workshop Information Retrieval with Asian Languages (2000).
5. Gao, J., Nie, J-Y., Xun, E., Zhang, J., Zhou, M. and Huang, C.: Improving Query Translation for
Cross- Language Information Retrieval using Statistical Models. In: Proceedings of the 24th
annual international ACM SIGIR conference on Research and development in information
retrieval (2001).
6. Jang, G. M., Kim, P., Jin, Y., Cho, S-H. and Myaeng, S. H.: Simple Query Translation Methods
for Korean-English and Korean-Chinese CLIR in NTCIR Experiments. In: Proceedings of
the Third NTCIR Workshop (2003).
7. Lin, W-C. and Chen, H-H.: Description of NTU Approach to NTCIR3 Multilingual Informa-
tion Retrieval. In: Proceedings of the Third NTCIR Workshop (2003).
8. Seo, H-C., Kim, S-B., Kim, B-I., Rim, H-C. and Lee, S-Z.: KUNLP System for NTCIR-3
English- Korean Cross-Language Information Retrieval. In: Proceedings of the Third NTCIR
Workshop (2003).
176 F. Lin and T. Mitamura
9. Pirkola, A.: The Effects of Query Structure and Dictionary Setups in Dictionary-Based Cross-
Language Information Retrieval. In: SIGIR 98 (1998).
10. Lam-Adesina, A. M. and Jones, G. J. F.: EXETER AT CLEF 2002: Experiments with Machine
Translation for Monolingual and Bilingual Retrieval. In: Advances in Cross-Language Infor-
mation Retrieval, Third Workshop of the Cross-Language Evaluation Forum, CLEF 2002
(2002).
11. Kwok, K. L.: NTCIR-3 Chinese, Cross Language Retrieval Experiments Using PIRCS. In:
Proceedings of the Third NTCIR Workshop (2003).
12. Zhang, J., Sun, L., Qu, W., Du, L., Sun, Y., Fan, Y. and Lin, Z.: ISCAS at NTCIR-3: Monolin-
gual, Bilingual and MultiLingual IR Tasks. In: Proceedings of the Third NTCIR Workshop
(2003).
Extraction of Name and Transliteration in Monolingual
and Parallel Corpora
1 Introduction
Multilingual named entity identification and (back) transliteration are important for
machine translation (MT), question answering (QA), cross language information
retrieval (CLIR). These transliterated names are not usually found in existing
bilingual dictionaries. Thus, it is difficult to handle transliteration only via simple
dictionary lookup. The effectiveness of CLIR hinges on the accuracy of transliteration
since proper names are important in information retrieval.
Handling transliterations of proper names is not as trivial as one might think.
Transliterations tend to vary from translator to translator, especially for names of less
known persons and unfamiliar places. That is exacerbated by different Romanization
systems used for Asian names written in Mandarin Chinese or Japanese. Back
transliteration involves conversion of transliteration back to the unique original name.
So there is one and only solution for most instances of the back transliteration task.
Therefore, back transliteration is considered more difficult than transliteration. Knight
and Graehl [4] pioneered the study of automatic transliteration by the computer and
proposed a statistical transliteration model to experiment on converting Japanese
R.E. Frederking and K.B. Taylor (Eds.): AMTA 2004, LNAI 3265, pp. 177–186, 2004.
© Springer-Verlag Berlin Heidelberg 2004
178 T. Lin, J.-C. Wu, and J.S. Chang
transliterations back to English. Following Knight and Graehl, most previous work on
machine transliteration [1, 2, 6, 8, 9, 10, 11] focused on the tasks of machine
transliteration and back-transliteration. Very little has been touched on the issue of
extracting names and their transliterations from corpora [5, 7].
The alternative to phoneme-by-phoneme machine (back) transliteration is simple
table lookup in transliteration memory automatically acquired from corpora. Most
instances of names and transliteration counterparts can often be found in large
monolingual or parallel corpora that are relevant to the task. In this paper, we propose
a new method for extraction names and transliterations based on a statistical model
trained automatically on a bilingual proper name list via unsupervised learning. We
also carried out experiments and evaluation of training and applying the proposed
model to extract names and translations from two parallel corpora and a Mandarin
translation corpus. The remainder of the paper is organized as follows. Section 2 lays
out the model and describes how to apply the model to align word and transliteration.
Section 3 describes how the model is trained on a set of proper names and
transliterations. Section 4 describes experiments and evaluation. Section 5 contains
discussion and we conclude in Section 6.
We will first illustrate our approach with examples. Consider transliterating the
English word “Stanford” into Mandarin Chinese. Although there are conceivably
many acceptable transliterations, the most common one is (Romanization:
“shi-dan-fo”). We assume that transliteration is done piecemeal, converting one to six
letters as a transliteration unit (TU) to a Mandarin character (transliteration character,
TC). For instance, in order to transliterate “Stanford,” we break it up into four TU’s:
“s,” “tan,” “for,” and “d.” We assume that each TU is converted to zero to two
Mandarin characters independently. In this case, the “s” is converted to “tan” to
“for” to and “d” to the empty string In other words, we model the
transliteration process based on independence of transliteration of TU’s. Therefore,
we have the transliteration probability of getting the transliteration given
“Stanford,” Stanford),
where a is the alignment between English transliteration units (mostly syllables) and
Mandarin Chinese characters
It might appear that using Chinese phonemes instead of characters will be more
effective to cope with data sparseness. However, there is a strong tendency to use a
limited set of characters for transliteration purposes. By using characters directly, we
can take advantage of lexical preference and obtain tighter estimates (see Figure 1).
There are several ways such a machine transliteration model (MTM) can be
applied, including (1) Transliteration of proper names (2) Back Transliteration to the
original proper name (3) Name-transliteration Extraction in a corpus. We will focus
on the third problem of name-transliteration extraction in this study.
Extraction of Name and Transliteration in Monolingual and Parallel Corpora 179
Fig. 1. TU and TC mappings; Low probability mappings are removed for clarity.
180 T. Lin, J.-C. Wu, and J.S. Chang
where
Given a list of proper names and transliterations, it is possible to break up names
and transliterations into pairs of matching TUs and TCs. With these matching TUs
and TCs, it is then possible to estimate the parameters for the machine transliteration
model. That process of applying model to decompose names and their transliterations
and train the model can be carried out in a self organized fashion by using the
Expectation Maximization (EM) algorithm [3].
Extraction of Name and Transliteration in Monolingual and Parallel Corpora 181
2.2 Training
Expectation Step: In the expectation step, the best way to describe how a proper
name gets transliterated is revealed via decomposition into TU’s which amounts to
calculating the maximum probability P(T, W) in Equation 3:
That also amounts to finding the best Viterbi path matching up TU’s in W and
TC’s in T. This can be done by using Equation (3). For that to be done efficiently, we
need to define and calculate the forward probability via
dynamic programming, denotes the probability of aligning the first i-1 Chinese
characters1 and the first j-1 English letters
1 The cases involving two Mandarin characters are rare. The “x” as in “Marx” is transliterated
into two TCs, while “xa” as in “Texas” is transliterated as
182 T. Lin, J.-C. Wu, and J.S. Chang
Maximization Step: With all the pairs obtained for a list of names and
transliteration in the Expectation Step, we update the maximum likelihood
estimates (MLE) of model parameters using Equation (5).
The EM algorithm iterates between the Expectation and Maximization Steps, until
converges. The maximum likelihood estimates is generally not suitable since
it does not capture the fact that there are other transliteration possibilities that we may
have not encountered. Based on our observations, we used the linear interpolation
(LI) of MLE and Romanization-based estimates of Equation (2) to approximate the
parameters in Machine Transliteration Model. Therefore, we have
There are two ways in which the machine transliteration model can be applied to
extract proper names and their transliterations. In this section, we describe the two
cases of transliteration extraction: extracting proper names and their transliterations
from a parallel corpus and monolingual corpus.
Extraction of Name and Transliteration in Monolingual and Parallel Corpora 183
(5) “When you understand all about the sun and all about the atmosphere and all
about the rotation of the earth, you may still miss the radiance of the sunset.” So
wrote English philosopher Alfred North Whitehead.
It is not difficult to build part of speech tagger and named entity recognizer for
finding the proper nouns, “Alfred,” “North,” and “Whitehead.” We then use the same
process of Viterbi decoding described in Section 2 to identify the transliteration in the
target language sentence S. All substrings of S are considered as transliteration
candidates.
perform reasonably well even with situations in the opposite direction, Chinese to
English transliteration. This indicates that the model with the parameter estimation
method is very general in terms of dealing with unseen events and bi-directionality.
There are several kinds of errors in our results. Inevitably, some errors are due to
data sparseness, leading to erroneous identification. Most errors however have to do
with translation not covered by the machine transliteration model such as the
translation for “San Francisco.” Some are part translation part
transliteration; such as the mapping between and “North Easton.” We
found out that “east,” “west,” “south,” “north,” “long,” “big,” “new,” “nova,” and
“St.” tend not to be transliterated, leading to errors in applying machine transliteration
model to extract name-translation pairs. These errors can be easily fixed by added
these as TU with literal translations to the table of transliteration probability. We have
restricted our discussion and experiments to transliteration of proper names.
Transliterations of Chinese common nouns into lower case English word are not
considered.
5 Conclusion
In this paper, we propose a new statistical machine transliteration model and describe
how to apply the model to extract words and transliterations in a monolingual and
parallel corpus. The model was first trained on a modest list of names and
transliterations. The training resulted in a set of ‘syllable’ to character transliteration
probabilities, which were subsequently used to extract proper names and
transliterations in a corpus. These named entities are crucial for the development of
named entity identification module in CLIR and QA.
We carried out experiments on an implementation of the word-transliteration
alignment algorithms and tested on three sets of test data. The evaluation showed that
very high precision rates were achieved.
A number of interesting future directions present themselves. First, it would be
interesting to see how effectively we can port and apply the method to other language
pairs such as English-Japanese and English-Korean. We are also investigating the
advantages of incorporating a machine transliteration module in sentence and word
alignment of parallel corpora.
186 T. Lin, J.-C. Wu, and J.S. Chang
References
1. Al-Onaizan, Y. and K. Knight. 2002. Translating named entities using monolingual and
bilingual resources. In Proceedings of the 40th Annual Meeting of the Association for
Computational Linguistics (ACL), pages 400-408.
2. Chen, H.H., S-J Huang, Y-W Ding, and S-C Tsai. 1998. Proper name translation in cross-
language information retrieval. In Proceedings of 17th COLING and 36th ACL, pages 232-
236.
3. Dempster, A.P., N.M. Laird, and D.B. Rubin. 1977. Maximum likelihood from incomplete
data via the EM algorithm. Journal of the Royal Statistical Society, 39(1):1-38.
4. Knight, K. and J. Graehl. 1998. Machine transliteration. Computational Linguistics,
24(4):599-612.
5. Lee, C.J. and Jason S. Chang. 2003. “Acquisition of English-Chinese Transliterated Word
Pairs from Parallel- Aligned Texts using a Statistical Machine Transliteration Model,” In
Proceedings of HLT-NAACL 2003 Workshop, pp. 96-103.
6. Lee, J.S. and K-S Choi. 1997. A statistical method to generate various foreign word
transliterations in multilingual information retrieval system. In Proceedings of the 2nd
International Workshop on Information Retrieval with Asian Languages (IRAL’97), pages
123-128, Tsukuba, Japan.
7. Lin, T., CJ Wu, and J.S. Chang. 2003 Word Transliteration Alignment, Proceedings of the
fifteenth Research on Computational Linguistics Conference, ROCLING XV, Hsinchu.
8. Lin, W-H Lin and H-H Chen. 2002. Backward transliteration by learning phonetic
similarity. In CoNLL-2002, Sixth Conference on Natural Language Learning, Taiwan.
9. Oh, J-H and K-S Choi. 2002. An English-Korean transliteration model using pronunciation
and contextual rules. In Proceedings of the 19th International Conference on
Computational Linguistics (COLING), Taiwan.
10. Stalls, B.G. and K. Knight. 1998. Translating names and technical terms in Arabic text. In
Proceedings of the COLING/ACL Workshop on Computational Approaches to Semitic
Languages.
11. Tsujii, K. 2002. Automatic extraction of translational Japanese-KATAKANA and English
word pairs from bilingual corpora. International Journal of Computer Processing of
Oriental Languages, 15(3):261-279.
Error Analysis of Two Types of Grammar
for the Purpose of Automatic Rule Refinement*
R.E. Frederking and K.B. Taylor (Eds.): AMTA 2004, LNAI 3265, pp. 187–196, 2004.
© Springer-Verlag Berlin Heidelberg 2004
188 A.F. Llitjós, K. Probst, and J. Carbonell
2 MT Approach
3 Learning Rules
For the experiment described in this paper, the translation direction is English
to Spanish, and the latter is used for illustration purposes to simulate a resource-
poor language. In the following discussion of the learning procedure, the x-side
always refers to English, whereas the y-side refers to Spanish.
Error Analysis of Two Types of Grammar 189
Fig. 1. Sample Translation Rule. x means source (here: English) and y, target (here:
Spanish).
The first step in the Rule Learning (RL) module is Seed Generation. For
each training example, the algorithm constructs at least one ‘seed rule’, i.e. a
flat rule that incorporates all the information known about this training example,
producing a first approximation to a valid transfer rule. The transfer rule parts
can be extracted from the available information or are projected from the English
side.
After producing the seed rule, compositionality is added to it. Composition-
ality aims at learning rules that can combine to cover more unseen examples.
The algorithm makes use of previously learned rules. For a given training exam-
ple, the algorithm traverses through the parse of the English sentence or phrase.
For each node, the system checks if there exists a lower-level rule that can be
used to correctly translate the words in the subtree. If this is the case, then an
element of compositionality is introduced.
In order to ensure that the compositionality algorithm has already learned
lower-level rules when it tries to learn compositional rules that use them, we
learn rules for simpler constituents first. Currently, the following order is ap-
plied: adverb phrases, adjective phrases, noun phrases, prepositional phrases,
and sentences. While this improves the availability of rules for the composi-
tionality module at the right time, the issue of co-embedding still needs to be
resolved. For more details, see [6].
4 Refining Rules
Instead of having to pay a human translator to correct all the sentences output
by an MT system, we propose the use of bilingual speakers to obtain information
about translation errors and use this information to correct the problems at their
root. Namely, by refining the translation rules that generated the errors.
The refinement process starts with an interactive step, which elicits informa-
tion about the correctness of the translations from users with the Translation
Correction Tool (TCTool). This tool allows bilingual users to correct the output
of an MT system and to indicate the source of the error. Users can insert words,
change the order of words, or specify that an agreement was violated.
190 A.F. Llitjós, K. Probst, and J. Carbonell
for all the relevant words that are different in T1 and T2 and
where n is the length of the sentence.
The resulting delta set is used as a guidance to explore the feature space of
potentially relevant attributes, until the RR module finds the ones that triggered
the correction, and can add the appropriate constraints to the relevant rules.
Suppose we are given the English sentence Juan is a great friend, and
the translation grammar has the following general rule for noun phrases NP: : NP
: [ADJ N] -> [N ADJ] (see Figure 1 for the complete rule), then the transla-
tion output from the MT system will be Juan es un amigo grande (T1), since
that is what the general rule for nouns and adjectives in Spanish dictates. How-
ever, this sentence instantiates an exception to the general rule, and thus, the
user will most likely correct the sentence into Juan es un gran amigo (T1’).
Now, the RR module needs to find a minimal pair (T2) that illustrates the
linguistic phenomenon that accounts for the MT error. Let’s assume that Juan
is a smart friend is also in the elicitation corpus and it gets correctly trans-
lated as Juan es un amigo inteligente.
The delta function of these two TL sentences evaluated is shown below:
Error Analysis of Two Types of Grammar 191
Since we are comparing minimal pairs, the delta function is reduced to comparing
just words that are different between the two TL sentences but are aligned to
the same SL word, or are in the same position and have the same POS, and
we do not need to calculate the delta functions for differing words that are not
relevant, such as (un,amigo).
If the delta set had contained one feature attribute that, when changed in
T2, users would correct it in the same way they corrected T1, then we would
hypothesize that it accounts for the correction and would bifurcate the NP rule
based on that attribute.
Since both adjectives are singular and undefined with respect to gender, the
delta set is empty and thus the RR module determines that the existing feature
set is insufficient to explain the difference between prenominal and postnominal
adjectives, and therefore it postulates a new binary feature, feat_1.
Once the RR module has determined what are the triggering features, it
proceeds to refine the relevant grammar and lexical rules to include prenominal
adjectives by adding the appropriate feature constraints. In this case, the RR
module creates a duplicate of the NP rule shown in Figure 1, switches N ADJ to
ADJ N on the target side and adds the the following constraint to it: (feat_1
= +). The original rule also needs to be refined to include the same constraint
with the other feature value (feat_1 = -). The lexicon will later need to code
for the new feature as well.
More detailed examples of minimal pair comparison and the feature space
exploration algorithm can be found in [2].
5 English-Spanish Experiment
We ran an experiment with two different grammars, a manually written one and
an automatically learned one using the RL module described in section 3.
The training set contains the first 200 sentences of the AVENUE elicitation
corpus, and the test set contains 32 sentences drawn from the next 200 sentences
in the corpus (the same that was used for the English-Spanish user study [3]).
Both MT systems used a lexicon with 442 entries, developed semi-automatically
seeking to cover the training and the test sets (400 sentences), so that the ef-
fectiveness of rules can be measured abstracting away from lexical gaps in the
translation system. The description of the two types of grammar follow.
Overall, the enhanced grammar rules learned by the Rule Learner make useful
generalizations and are quite close to what a linguist might write, however they
often contain constraints that are either too specific or too general. In the rules
above, the constraint about the tense of the verb in S, 90 is too specific; ideally
we would like this rule to apply to all VPs regardless of their tense.
On the other hand, the Rule Learner can overgeneralize; when it finds that a
feature value match occurs often enough, it will decide that there is a correlation,
and thus an agreement constraint gets added to the rule. This is illustrated by
the number agreement constraint in rule VP,46. Because enough examples in the
training data had the verb and the object coincide in number, it assumes that
a generalization can be made. In Spanish, however, verbs and their objects do
not always agree.
The reason there are many more rules in the learned grammar is that the cur-
rent implementation of the RL module does not throw away rules that are too
specific and that are subsumed by rules which have achieved a higher level of
generalization during the learning process.
When running our transfer system on a test set of 32 sentences, it was ob-
served that the manually written grammar results in less ambiguity (on average
1.6 different translations per sentence) than the automatically learned grammar
(18.6 different translations per sentence). At the same time, the final version of
the learned grammar results in less ambiguity than the learned grammar with
no constraints. This is to be expected, since relevant constraints will restrict the
application of general rules to the appropriate cases.
Additional ambiguity is not necessarily a problem; however, the goal of the
transfer engine should be to produce the most likely outputs first, i.e. with the
highest rank. In experiments reported elsewhere [4], we have used a statistical
decoder with a TL language model to disambiguate between different translation
hypotheses. We are currently investigating methods to prioritize rules and partial
analyses within the transfer engine, so that we can rank translation hypotheses
also when no TL language model is available.
While this work is under investigation, we emulated this module with a simple
reordering of the grammar: we reordered three rules (2 NPs and 1 S rule) that had
a high level of generalization (namely containing agreement constraints instead
of the more specific value constraints) to be at the beginning of the grammar.
This in effect gives higher priority to the translations produced with these rules.
Most of the translation errors produced by the manual grammar can be classified
into lack of subj-pred agreement, wrong word order of object pronouns (clitic),
wrong preposition and wrong form (case) and out-of-vocabulary word. On top of
the errors produced by the manual grammar, the current version of the learned
194 A.F. Llitjós, K. Probst, and J. Carbonell
grammar also had errors of the following types: missing agreement constraints,
missing preposition and overgeneralization.
An example of differences between the errors produced by the two types of
grammar can be seen in the translation of John and Mary fell. The manual
grammar translates it as y Maria cayeron, whereas the learned gram-
mar translated it as y Maria cai. The learned grammar does have an
NP rule that covers [Juan y Maria], however it lacks the number constraint
that indicates that the number of an NP with this constituent sequence ([NP
CONJ NP]) has to be plural. The translation produced by the manual gram-
mar, y Maria cayeron, is also ungrammatical, but the translation error
in this case is a bit more subtle and thus much harder to detect. The correct
translation is Juan y Maria se cayeron.
Another example to illustrate this is the translations of John held me with
his arm. The MT system with the manual grammar outputs sujetó me
con su brazo, whereas the one with the learned grammar outputs Juan
sujetó me con su brazo.
Sometimes the output from the learned grammar is actually better than the
manual grammar output. An example of this can be seen in the translations of
Mother, can you help us?. The manual grammar translated this as
puedo ayudar nos? and the learned grammar translates is as puedes
tu ayudar nos?. The number of corrections required to fix both translations is
the same, but the one produced by the learned grammar is clearly better.
References
1. Charniak, E.: A Maximum-Entropy-Inspired Parser, North American chapter of
the Association for Computational Linguistics (NAACL), 2000.
2. Font Llitjós, A.: Towards Interactive and Automatic Refinement of Translation
Rules, PhD Thesis Proposal, Carnegie Mellon University, forthcoming in August
2004(www.cs.cmu.edu/~aria/ThesisProposal.pdf).
3. Font Llitjós, A., Carbonell, J.: The Translation Correction Tool: English-Spanish
user studies, 4th International Conference on Language Resources and Evaluation
(LREC), 2004.
4. Lavie, A., Vogel, S., Levin, L., Peterson, E., Probst, K, Font Llitjós, A., Reynolds,
R., Carbonell, J., Cohen, R.: Experiments with a Hindi-to-English Transfer-based
MT System under a Miserly Data Scenario, ACM Transactions on Asian Language
Information Processing (TALIP), 2:2, 2003.
5. Papineni, K., Roukos, S., Ward, T.: Maximum Likelihood and Discriminative
Training of Direct Translation Models, Proceedings of the International Confer-
ence on Acoustics, Speech, and Signal Processing (ICASSP-98), 1998.
6. Probst, K., Levin, L., Peterson, E., Lavie, A., Carbonell J.: MT for Resource-Poor
Languages Using Elicitation-Based Learning of Syntactic Transfer Rules, Machine
Translation, Special Issue on Embedded MT, 2003.
7. Probst, K., Brown, R., Carbonell, J., Lavie, A., Levin, L., Peterson, E.: Design and
Implementation of Controlled Elicitation for Machine Translation of Low-density
Languages, Workshop MT2010 at Machine Translation Summit VIII, 2001.
The Contribution of End-Users
to the TransType2 Project
Elliott Macklovitch1
RALI Laboratory, Université de Montréal
C.P. 6128, succursale Centre-ville
Montréal, Canada H3C 3J7
macklovi@iro.umontreal.ca
1 Introduction
The goal of the TransType2 project (Foster et al. 2002) is to develop a novel type of
interactive machine translation system. The system observes the user as s/he types a
translation, attempts to infer the target text the user has in mind and periodically
proposes extensions to the prefix which the user has already keyed in. The user is free
to accept these completions, modify them as desired or ignore them by simply
continuing to type. With each new character the user enters, the system revises its
predictions in order to make them compatible with the user’s input.
In itself, interactive machine translation (IMT) is certainly not novel; in fact, the
first attempts at IMT go back to the MIND system, which was developed by Martin
Kay and Ron Kaplan at the RAND Corporation in the late 1960’s.2 There have been
numerous subsequent attempts to implement IMT, some of which gave rise to
commercial systems, like ALPS’ ITS system, while others have been embedded in
controlled language systems, like the KANT system developed at CMU (Nyberg et al.
1997).3 What all of these previous efforts share in common is that the focus of the
interaction between the user and the system is on the source text. In particular,
whenever the system is unable to disambiguate a portion of the source text, it requests
assistance from the user. This can be to help resolve various types of source language
ambiguity, such as the correct morpho-syntactic category of a particular word,
syntactic dependencies between phrases, or the referent of an anaphor. In principle,
once the user has provided the system with the information necessary to disambiguate
1 The work described in this article is the fruit of a sustained collaborative effort, and I want to
express my gratitude to all the participants in the TT2 Consortium, particularly to the
translators who are testing successive versions of the system.
2 For more on MIND, see (Hutchins 1986), pp.296-297.
3 Other IMT systems specifically focus on multi-target translation; see for example (Blanchon
and Boitet 2004 ) and (Wehrli 1993).
R.E. Frederking and K.B. Taylor (Eds.): AMTA 2004, LNAI 3265, pp. 197–207, 2004.
© Springer-Verlag Berlin Heidelberg 2004
198 E. Macklovitch
the source text, the system can then complete its analysis and continue to properly
translate the text into the target language.
This is not the place to enumerate all the difficulties that have dogged this classic
approach to interactive MT; however, there are a few important differences with
TransType which we should point out. Suppose that the user of the system is a
translator, as was often the case in the early decades of MT. Notice that the kind of
information being solicited from the user by these classic IMT systems does not focus
on translation knowledge per se, but instead involves formal linguistic analysis, of a
kind that many translators have not been trained to perform. In contrast, the focus of
the interaction between the user and the system in TransType is squarely on the
drafting of the target text. After reading the current source text segment, the translator
begins to type his/her desired translation. Based on its analysis of the same source
segment and using its statistical translation and language models, TransType
immediately proposes an extension to the characters the user has keyed in. The user
may accept all or part of the proposed completion, or s/he may simply go on typing;
in which case, the system continues trying to predict the target text the user has in
mind. When the system performs well, the user will normally accept these machine-
generated completions, thereby diminishing the number of characters s/he has to type
and hopefully reducing overall translation time. But the important point is that in this
paradigm both the user and the system contribute in turn to the drafting of the target
text, and the translator is not solicited for information in an area in which s/he is not
an expert.
Another important difference between classic IMT systems and the target-text
mediated approach of TransType may be formulated in this way: Who leads? In the
classic IMT approach, it is the system that has the initiative in the translation process;
the system decides when and what to ask of the user, and once it has obtained the
required information from the user, the system will autonomously generate its
translation, much like any other fully automatic MT system. In the best of
circumstances, the system will succeed in producing a grammatical and idiomatic
sentence in the target language which correctly preserves the meaning of the source
sentence. But even in this ideal situation, it would mistaken to believe that this is the
only correct translation of the source sentence; for as every translator knows, almost
any source text admits of multiple, equally acceptable target language renditions. As
(King et al. 2003) put it:
“There is no one right translation of even a banal text, and even the same
translator will sometimes change his mind about what translation he prefers.”
(p.227)
What happens if the translation generated by the system does not correspond to the
one which the user had in mind? One of two things: either the user changes his/her
mind and accepts the machine’s proposal; or the user post-edits the system’s output,
changing it so that it conforms to the translation s/he intended. But in either case, it is
the user who is responding to, or following the system’s lead. In TransType, on the
other hand, it is entirely the other way round. The user guides the system by providing
prompts in the form of target text input, and the system reacts to those prompts by
trying to predict the rest of the translation which the user is aiming for. Moreover, the
system must adapt its predictions to changes in the user’s input. Here, quite clearly, it
is the user who leads and the system which must follow.
The Contribution of End-Users to the TransType2 Project 199
improvements in the statistical translation engines; and to do this, the principal means
employed are automatic metrics such as word error rate or methods like BLEU. The
usability evaluation, on the other hand, inescapably involves the intended end-users of
TransType, i.e. professional translators; and here, the goal is to evaluate, not so much
the performance of the system in vitro (as it were), but its actual impact on the
productivity of working translators and the ease (or difficulty) with which they adapt
to the system. An equally important objective of the user trials is to provide a channel
of communication through which the participating translators can furnish feedback
and suggestions to the system developers, so that the latter can continue to make
improvements to the system.4
We have just completed the third round of user evaluations in the TT2 project, and
the first in which the participating translators at Gamma and at Celer have actually
had the opportunity to work with the system in a mode that approximates their real
working conditions. In the following section, we present our objectives for this round
of user trials and the protocol which governed its organization.
4 To encourage users to provide such feedback, TT2 includes a pop-up notepad, with entries
that are automatically time-stamped and identified with the user’s name.
5 Currently, TT2 only accepts plain text files as input. By consulting the PDF original, the
participants could situate certain segments extracted from tables or graphics within their
proper context.
6 At the time of the ER3 trial, the RALI’s maximum entropy engine could not provide
completions longer than five words. ITI’s engine, which was the other system configured to
provide multiple predictions, was able to provide longer predictions, up to full-sentence
length.
202 E. Macklovitch
Metric: Time for interactive translation – Method: Measure the amount of time
it takes to perform interactive translation on test corpus. – Measurement:
Amount of time for interactive translation on test corpus.
7 In the context of its work on the original TransType project, the RALI did elaborate an
evaluation methodology specifically designed for interactive MT; see (Langlais et al. 2002).
Needless to say, we drew heavily on this experience.
8 C.f. http:// www.issco.unige.ch/projects/isle/femti/
204 E. Macklovitch
The first metric, notice, betrays a certain bias toward classic IMT systems; the tacit
assumption seems to be that the fewer times the system requires the user’s assistance,
the better. Our bias in target-text mediated IMT is quite different. What we want to
count is the number of times that the user accepts the system’s proposals in drafting
his/her translation; and in principle, the more often s/he does this, the better. As for
the second FEMTI metric, this is precisely the way we have adopted to measure our
participants’ productivity. The following table lists the parameters that TT-Player was
programmed to extract from the trace files on ER3. It also summarizes the results of
one of the participants on the “dry-run”, when the prediction engine was turned off,
and on a second session, with one of the two prediction engines turned on.
The Contribution of End-Users to the TransType2 Project 205
From the table, we see that by using the predictions provided by the RWTH
engine, the translator was actually been able to increase her productivity from 14.4
words per minute on the dry-run to an impressive 18.9 words per minute on Chapter
M2_2. During this two and a half hour session, the system proposed 5961
completions, of which the translator accepted 6.4%.9 The average length of an
accepted completion was 4.3 words, and the average time required to accept a
completion was 11.4 seconds. The next six lines provide information on the manner in
which the user accepted the system’s proposals: in whole or in part, using the
keyboard or the mouse. The final four lines furnish various ratios between the length
of the participant’s target text (in words or in characters) and different types of
actions, e.g. the number of characters typed or deleted during the session. In the final
line, we see that translating Chapter 2_4 on her own, the translator required an
average of 8.6 keystrokes or mouse-clicks per word, whereas on Chapter M2_2, with
the benefit of the system’s predictions, the number of actions per word dropped to 3.8.
5 The Results
Before presenting a synthesis of the results we obtained on ER3, a number of caveats
are definitely in order. As we mentioned above, this was the first time that the
participants at Gamma and at Celer were actually translating with TT2 in a mode that
resembles their real working conditions; but “resembles” is the operative word here.
TransType remains a research prototype and as such its editing environment does not
offer all the facilities of a commercial word processor, e.g. automatic spell checking
or search-and-replace. Moreover, this was a very small-scale test, involving only four
texts and less than ten thousand words of translation. Hence, the results we present
below must be viewed as tentative. At least two other evaluation rounds are planned
before the end of the TT2 project, during which the participants will be asked to work
with the system for longer periods. Finally, there is another important caveat which
should cause us to be cautious, and it has to do with quality controls, of which there
were none in this round. During the preparatory sessions at the two agencies, the
participants were asked to produce translations of “deliverable” quality; but in fact,
we did nothing to ensure that this was the case, relying only on the translators’ sense
of professionalism. Hence, there was nothing to prevent one participant from rushing
through his/her translation, without attempting to reread or polish it, while another
might well invest significant time and effort in improving the quality of the final text,
even though this would have a negative impact on his/her productivity.
With these caveats in mind, let us now turn to the “bottom-line” quantitative results
on ER3. Assuming that the baseline figures provided on the dry-run are reliable, three
of the four participants succeeded in increasing their productivity on at least one of
the four texts they translated with TransType. If they were able to do so, it was largely
owing to the performance of certain of the prediction engines. In particular, the
participants at Celer were able to achieve impressive productivity gains using ITI’s
English-to-Spanish prediction engine. One translator at Celer more than doubled
his/her word-per-minute translation rate using the ITI engine on one of the texts; on
9 This number of predictions may appear at first to be very high; but then it must be
remembered that TransType revises its predictions with each new character the user enters.
206 E. Macklovitch
another text, the second Celer translator logged the highest productivity of all the
participants on the trial, again using ITI’s English-to-Spanish prediction engine.
However, when we examine more closely the manner in which the some of the
participants actually used the completions proposed by TT2, there is somewhat less
cause for jubilation. It seems quite clear that in certain sessions the translators opted
for a strategy that is closer in spirit to classic MT post-editing than it is to interactive
MT. Instead of progressively guiding the system via prompts toward the translation
that they had in mind, the users would often accept the first sentence-length prediction
in its entirety, and then edit to ensure its correctness. That the participants were able
to increase their productivity by post-editing in this manner certainly speaks well for
the translation engines involved. However, our fundamental goal in the TT2 project is
to explore the viability of interactive machine translation, and this strategy which
certain participants adopted – and which is confirmed, incidentally, by replaying the
sessions in TT-Player – cannot really be viewed as true IMT.
Still, in our research project as elsewhere, the customer is always right. If the
participants at Celer and at Gamma did not make more extensive use of the system’s
interactive features, it can only be because they felt it was not useful or productive (or
perhaps too demanding) for them to do so. Thus, the challenge for the engine
developers in the remainder of the TT2 project is to enhance the system’s interactive
features so that the users will freely choose to exploit them to greater advantage.
In addition to the translations they produced, the participants also provided us with
a number of insightful comments about their experience in working with TT2. One
user told us, for example, that five alternate completions may be too many,
particularly when the differences between them are minimal. The participants also
pointed out a number of irritants in the GUI, e.g. the fact that the text does not
automatically scroll up when the user reaches the bottom-most segment on the screen;
or the occasional incorrect handling of capitalization and spacing around punctuation
marks. Although none of these are major, they are a source of frustration for the users
and do cause them to lose production time. Finally, the trial appears to validate the
decision to base a large part of our usability evaluations on the automatic analysis of
the trace files generated in each translation session. Not only is TT-Player able to
produce a detailed statistical analysis of each session; it also allows us to verify
certain hypotheses by replaying the session, as though we were actually present and
looking over the translator’s shoulder.
6 Conclusions
It remains to be seen whether fully interactive, target-text mediated MT like that
offered by TransType will prove to be a productive and a desirable option for
professional translators who are called on to produce high-quality texts. The TT2
project still has more than a year to run and there are many improvements we plan to
implement and many avenues that we have yet to explore. One thing is already
certain, however, and that concerns the essential role that end-users can play in
orienting an applied research project like this one. The translators at Gamma and at
Celer have already made important contributions to TT2, both in preparing the
system’s functional specifications and in helping to design its graphical user interface.
And through their participation in the remaining evaluation rounds, it is they who will
The Contribution of End-Users to the TransType2 Project 207
have the last word in deciding whether this novel and intriguing approach to
interactive MT is worth pursuing.
References
Blanchon, H., and Boitet, C.: Deux premières étapes vers les documents auto-explicatifs. In:
Actes de TALN 2004, Fès, Morocco (2004) pp. 61-70
Foster, G., Langlais, P., Lapalme, G.: User-Friendly Text Prediction for Translators. In:
Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing
(EMNLP), Philadelphia (2002) pp. 148-155
Hutchins, W. John: Machine Translation: Past, Present, Future. Ellis Horwood Limited,
Chichester, United Kingdom (1986)
King, M., Popescu-Belis, A., Hovy, E.: FEMTI: creating and using a framework for MT
evaluation. In: Proceedings of MT Summit IX, New Orleans (2003) pp. 224-231
Langlais, P., Lapalme, G., Loranger, M.: TRANSTYPE: Development–Evaluation Cycles to
Boost Translator’s Productivity. Machine Translation 17 (2002) pp. 77-98
Nyberg, E., Mitamura, T., Carbonell, J.: The KANT Machine Translation System: From R&D
to Initial Deployment. LISA Workshop on Integrating Advanced Translation Technology,
Washington D.C., (1997)
Wehrli, E.: Vers un système de traduction interactif. In Bouillon, P., Clas, A. (eds.): La
Traductique. Les presses de l’Université de Montréal, AUPELF/UREF (1993) pp. 423-432
An Experiment on Japanese-Uighur Machine
Translation and Its Evaluation
1 Introduction
Machine translation(MT) has been a very challenging field in the area of natu-
ral language processing for many years. The very early approaches were largely
unsuccessful, not only for lack of computing resources and/or machine readable
language resources, but also because the complexity of the interaction effects
in natural language phenomena had been underestimated. On the other hand,
machine translation is an applied area which benefit from advances in the area of
theoretical artificial intelligence(AI) and natural language processing(NLP) – in
spite of some partial results which have been achieved, we are still far from a sat-
isfying treatment of natural language. Recently, the computing resources seem
no longer to be critical problem, and machine readable language resources like
R.E. Frederking and K.B. Taylor (Eds.): AMTA 2004, LNAI 3265, pp. 208–216, 2004.
© Springer-Verlag Berlin Heidelberg 2004
An Experiment on Japanese-Uighur Machine Translation 209
corpora have been available (for some major languages like English, Japanese,
Spanish and so on). And consequently, seeking ways of avoiding the need for
massive knowledge acquisition by rejecting the entire established NLP paradigm
in favor of knowledge-free, linguistics- and AI-independent approaches have be-
come a modest approach for MT. For example, statistical MT [1] , which seeks to
carry out translation based on complex cooccurrence and distribution probability
calculations over very large aligned bilingual text corpora, and example-based
MT [2] , which involves storing translation examples in a database, and then
matching the input sentence against these examples, finding the best matching
translation example.
Japanese and Uighur have many syntactical and language structural similar-
ities, including word order, existence and same functions of case suffixes and ver-
bal suffixes1, morphological structure, etc. More importantly, syntactical struc-
ture and language structure depend on case suffixes and verbal suffixes in the
two languages. For these reasons, we have a suggestion that we can translate
Japanese into Uighur in such a manner as word-by-word aligning after morpho-
logical analysis of the input sentences, without complicated syntactical analy-
sis, i.e. we can align an appropriate Uighur case suffix correspondent to each
Japanese case suffix in the input sentences, instead of doing syntactical analysis,
and we can avoid the need for massive knowledge acquisition.
However, there is some divergence on the treatment of the verbal suffixes be-
tween the two languages from the point of view of traditional Japanese grammar.
For resolving the divergence, we utilized a Japanese-Uighur machine translation
approach[3,4], which based on the derivational grammar[5].
On the other hand, there is few one-to-one correspondences between Japanese
and Uighur nominal suffixes, especially the case suffixes that specify the role of
noun phrases in the sentences. For resolving this problem, we utilized the case
pattern based approach[6,7], and the common characteristics that dependent
relations between nominal suffixes and verbal phrases in the two languages.
In the process of Japanese-Uighur machine translation, we also have to re-
solve the phonetic change problem due to the vowel harmony phenomena in
Uighur. For this purpose, we formalized a phonetic change rules and achieved a
morpheme synthesizing system in high precision.
And furthermore, we have incorporated a Japanese-Uighur dictionary[8,9],
including about 20,000 words for our Machine Translation system. Thus, we have
accomplished a nearly practical Japanese-Uighur machine translation system
which consists of verbal suffix processing, case suffix processing, phonetic change
processing, and a Japanese-Uighur dictionary.
From the point of view of pragmatical usage, we have chosen three articles
about environmental issue appeared in the Nippon Keizai Shinbun, and con-
ducted a translation experiment on the articles with our system. As a results
of our experiment, 84.8% of precision has been achieved. Here, we counted the
correctness of a phrase in the output sentences to be the evaluating criterion.
1 Also say case particles and auxiliary verbs respectively in traditional Japanese gram-
mar.
210 M. Mahsut et al.
This is because almost sentences in practical use are long sentences, and the
less appearances of incorrect phrases in the output sentences are very desired in
translation process, especially in the situation that a machine translation system
acts as a “computer aided translation system.”
In this paper, we will illustrate the similarities between Japanese and Uighur,
and point out of the problems we have to resolve in the process of machine
translation. Then, describe an implementation of our machine translation sys-
tem based on derivational grammar and case suffixes replacement approaches.
Finally, we will do an experiment and evaluation on the system to show valid-
ity of our argument that we can achieve a Japanese-Uighur machine translation
system in high precision without syntactical analysis.
Note that, in this paper, we have transcribed Uighur characters in Roman
based alphabet. But in modern Uighur, an Arabic based alphabet, as shown in
Table 1 , is rather being used.
The same case can be found in Uighur Language. Here, the case suffixes
‘-ø’(nominative case) and ‘-ni’(accusative case) in Uighur are respectively cor-
responding to the case suffixes ‘-ga’ and ‘-wo’ in Japanese. But in Uighur, the
nominative case is often indicated by the ‘zero-form’, and we show it by ‘ø’. So,
we can say the same thing about the word order, i.e., the Uighur case suffixes
make it possible to change the word order in a sentence without change of the
meaning of the sentence.
Thus, we can translate both of Japanese sentences “karega tobirawo aketa”
and “tobirawo karega aketa” into Uighur sentences “Uø ixikni aqti.” and “Ixikni
uø aqti.” respectively in the manner of word-by-word alignment.
This observation means that the case suffixes play the essential roles on the
syntactical structures in Japanese and Uighur, and we can find that Japanese-
Uighur machine translation can be achieved without complicated syntactical
analysis. The detailed observation can be found in [6] and [7].
4 Suffix Adjustment
Now let us proceed to our discussions on realization of our word-by-word trans-
lation from Japanese to Uighur. The facts we revealed so far show that the
problems to be resolved here is how to decide verbal and case suffix correspon-
dences correctly. To overcome these problems, we adopt a method to assign the
default Uighur suffix to each Japanese suffix and then to substitute a well-fitted
suffix for an unnatural one under replacement rules. Since a verbal stem and a
following verbal suffix affect each other, we can choose an appropriate suffix by
knowing the right and left words. On the other hand, the verb which the nominal
with a case suffix depends on affects the suffix. So we need to decide the correct
An Experiment on Japanese-Uighur Machine Translation 213
case suffix considering the verbs depended. About this subject we can refer [7]
and [4].
Fig. 2. A sample Japanese passage from first article which used for our translation
experiment. From the point of view of practical usage, we have chosen three articles
about environmental issue appeared in Nihon Keizai Shimbun, and conducted a full-
text translation experiment on the articles with our MT system
applied to those Uighur suffixes if they match the conditions on the replacement
table.
Thirdly, case suffixes are replaced if the verbs that they depend have the
replacement rules satisfied the condition. In the example, the first step trans-
lates ‘-wo’ into ‘-ni’. But the verb ‘watar’ on which the noun phrase “hashiwo”
depends has a replacement rule (watar, consonant verb, öt{wo/din}). So, ‘-ni’
is replaced with ‘-din’. Finally, the morpheme synthesizing system synthesizes
Uighur morphemes according to the context of personal suffix and the phonetic
change rule and generates a Uighur output sentence.
6 Evaluation Experiment
From the point of view of pragmatical usage, we have chosen three articles about
environmental issue appeared in the Nihon Keizai Shimbun, and conducted a
translation experiment on the articles with our system which includes about
20,000 words of Japanese-Uighur Dictionary. The articles have 136 Japanese
sentences and include 306 verbal phrases(254 different patterns). We show a
portion of those sentences and its translations in Figure 2, 3, and 4.
We compared each phrase in the system translated sentences against the
correspondent phrase in the native Uighur speaker translated sentences. Thus,
An Experiment on Japanese-Uighur Machine Translation 215
In this paper, we illustrated the similarities between Japanese and Uighur, and
pointed out of the problems we have to resolve in the process of machine trans-
lation. Then, we described the basic structure of our machine translation system
based on derivational grammar and case suffixes replacement approaches. The
translation system has succeeded in systematic word-by-word translation. In ad-
dition, it can generate nearly natural Uighur sentences by using replacement
rules. So, we can say that, by utilizing of similarities between Japanese and
216 M. Mahsut et al.
References
1. Brown, P., Cocke, J., Della Pietra, S., Della Pietra, V., Jelinek, F., Mercer, R.L.
and P.S. Roossin: A statistical approach to language translation. Computational
Linguistics. vol 16, (1990) 79–85
2. Nagao, M.: A framework of a mechanical translation between Japanese and English
by analogy principle. In: A. Elithorn and R. Banerji (eds.) Artificial and Human
Intelligence. NATO Publications(1984).
3. Ogawa, Y., Muhtar M., Sugino, K., Toyama, K. and Inagaki, Y.: Generation of
Uighur Verbal Phrases Based on Derivational Grammar. Journal of Natural Lan-
guage Processing(in Japanese), Vol.7, No.3 (2000) 57–77
4. Muhtar, M., Ogawa, Y., Sugino, K. and Inagaki, Y.: Utilizing Agglutinative Frea-
tures in Japanese-Uighur Machine Translation. In: Proceedings of the MT Summit
VIII, Santiago de Compostela, Galicia, Spain (2001) 217–222
5. Kiyose, G. N.: Japanese grammar –A new approach–. Kyoto University Press (1995)
6. Muhtar, M., Casablanca, F., Toyama, K. and Inagaki, Y.: Particle-Based Machine
Translation for Altaic Languages: the Japanese-Uighur Case. In: Proceedings of the
3rd Pacific Rim International Conference on Artificial Intelligence, Vol.2. Beijing,
China (1994) 725–731
7. Muhtar, M., Ogawa, Y. and Inagaki, Y.: Translation of Case suffixes on Japanese-
Uighur Machine Translation. Journal of Natural Language Processing(in Japanese),
Vol.8, No.3 (2001) 123–142
8. Muhtar, M., Ogawa, Y., Sugino, K. and Inagaki, Y.: Semiautomatic Generation
of Japanese-Uighur Dictionary and Its Evaluation. Journal of Natural Language
Processing(in Japanese), Vol.10, No.4 (2003) 83–108
9. Muhtar, M., Ogawa, Y., Sugino, K. and Inagaki, Y.: Semiautomatic Generation of
Japanese-Uighur Dictionary and Its Evaluation. In: Proceedings of the 4th Work-
shop On Asian Language Resources (2004) 103–110
A Structurally Diverse Minimal Corpus for
Eliciting Structural Mappings Between
Languages*
R.E. Frederking and K.B. Taylor (Eds.): AMTA 2004, LNAI 3265, pp. 217–226, 2004.
© Springer-Verlag Berlin Heidelberg 2004
218 K. Probst and A. Lavie
information about this language. However, we do not simply elicit any data: A
naturally occurring corpus exhibits all the structure and feature phenomena of a
language; the disadvantage of using a naturally occurring corpus for elicitation
is that it is highly redundant, especially in the most frequent phenomena (e.g.
DET ADJ N). An elicitation corpus, on the other hand, is a condensed set
of sentences, where each sentence pinpoints a different phenomenon of interest.
Because the elicitation will contain some noise, a certain amount of redundancy
should be built into the corpus.
In our previous work on elicitation, we focused mainly on feature in-
formation [6], [8]. The resulting corpus contains sequences such as He has
one daughter, He has one son, He has two daughters, and He has two
sons. Clearly, such a sequence of sentences will be crucial in detecting the
marking of gender, number, case, and other features. It will however be redun-
dant if the goal is to learn how the structure VP is expressed in the target
language. Redundancy in itself is not necessarily detrimental and can actually
help in learning exceptions to general rules. However, redundancy will come at
the expense of diversity. Precisely because in feature elicitation it is important to
hold as many other factors as possible constant (i.e. elicit gender differences by
comparing two sentences that differ only in gender), a very different elicitation
corpus is needed when eliciting structures. For this reason, the most important
design criterion for the corpus described in this paper is that it should be di-
verse by including a wide variety of structural phenomena. Further, it should be
sufficiently small for a bilingual user to translate it within a matter of hours.
As was noted before [8], one inherent challenge of automatic language elici-
tation is a bias towards the source language (SL). The user is simply presented
with a set of sentences to translate. Phenomena that might occur only in the
TL are not easily captured without explanations, pictures, or the like. In par-
allel work, we are addressing the issue; meanwhile, we must be aware of this
elicitation bias. We handle it in our rule learning module by only conservatively
proposing TL structures. Structures that are not mirrored in the SL are not
learned explicitly. For more details on the rule learning, refer to [7].
full sentences. In our experiment, the enhanced Penn Treebank training corpus
contains 980120 sentences and phrases. Elicited examples of different types are
used by the rule learner to transfer rules for different structures, so that the rules
can compositionally combine to cover larger input chunks.
The next step is to represent each parse by a meaningful identifier. This is
done in order to determine how many different structures are present in the
corpus, and how often each of these structures occurs. Consider the following
two examples sentences:
We can see that they essentially instantiate the same high-level structure of
a sentence, namely VP. For this reason, we chose to represent each parse
as an instance of its component sequence, which describes the parse’s general
pattern. Since a pattern is always uniquely represented by a component sequence,
the two terms are essentially used interchangeably in the remainder of the paper.
The component sequence of a parse is defined as a context-free rule whose left-
hand side is the label of the parse’s top node, and whose right-hand side is the
series of node labels one level down from the parse’s top node, VP in the
examples above.
The resulting training corpus contains a list of sentences and phrases, to-
gether with their parses and corresponding component sequences. We then cre-
ate a list of all the unique patterns (component sequences) encountered in the
training data and a count of how many times each such sequence occurred. The
sequences are sorted by types, i.e. the label of the parse’s top node. The elic-
itation corpus we want to create should contain example patterns from each
type. We chose to focus on the following types: ADVP, ADJP, NP, PP, SBAR,
and finally S, as we believe these to be stable constituents for transfer between
many language pairs. Future work may address other types of structures, such
as VP or WHNP (e.g. ‘in what’). Some pattens occur frequently, whereas oth-
ers are rarely encountered. For example, NPs can exhibit different patterns, e.g.
or N, both of which are very frequent patterns, but also, less
frequently, ADJP N N. Table 1 shows the five most frequent patterns
for each type, together with their frequency of occurrence in the training corpus.
In order to maximize the time effectiveness of the bilingual speaker who will
translate the corpus, we wish to focus on those patterns that occur frequently.
At the same time, we would like to know that we have covered most of the
probability mass of the different patterns of a given type. We chose to use the
following method: for each pattern, we plot a graph depicting the cumulative
probability with the addition of each pattern. An example of such a graph can
be seen in Figure 1 below. The y-axis in this graph is the cumulative probability
220 K. Probst and A. Lavie
covered (i.e. what portion of the occurences of this type in the training corpus are
covered), and the x-axis is the cumulative number of patterns. In other words,
the highest-ranking NP pattern accounts for about 17.5% of all occurrences
of NPs in the training data; the highest and second highest ranking patterns
together account for about 30% of NPs. We then linearly interpolate the data
points in the graph, and choose as a cutoff the relative addition of probability
mass by a pattern: We compute for each pattern the amount of probability that
is added when adding a pattern. This can be computed by where
is the number of times pattern occurred in the training data and is
the number of unique patterns for a specific type. We include in the corpus all
patterns whose falls above a threshold. For the experiments presented here,
we chose this threshold at 0.001. This allows us to capture most of the relevant
structures for each type, while excluding most idiosyncratic ones. For instance,
the lowest-ranking NP pattern included in the corpus still occurred more than
300 times in the original corpus.
For each of the patterns that is to be included in the elicitation corpus,
we would like to find an example that is both representative and as simple
as possible. For instance, consider again the two sentences presented above:
The jury talked. and Robert Snodgrass , state GOP chairman , said
a meeting held Tuesday night in Blue Ridge brought enthusiastic
responses from the audience. Clearly, the first sentence could be a useful
A Structurally Diverse Minimal Corpus for Eliciting Structural Mappings 221
elicitation sentence, while the second sentence introduces much more room
for error: a number of reasons (such as lexical choice, the complex SBAR
structure, etc.) could prevent the user from translating this sentence into a
similar structure. We therefore would like to create a corpus with representative
yet simple examples. In order to automate this process somewhat, we extract
for each pattern one of the instantiations with the fewest number of parse
nodes. This heuristic can help create a full elicitation corpus automatically.
It is however advisable to hand-inspect each of the automatically extracted
examples, for a variety of reasons. For instance, the automatic selection process
cannot pay attention to lexical selection, resulting in sentences that contain
violent or otherwise inappropriate vocabulary. It can also happen that the
automatically chosen example is in fact an idiomatic expression and would
not easily transfer into the a TL, or that the structure, taken out of context,
is ambiguous. Some patterns are also not appropriate for elicitation, as they
are idiosyncratic to English, e.g. determiners can make up NPs (e.g the ‘this’
in This was a nice dinner), but this does not necessarily hold for other
languages. Finally, the Penn Treebank contains some questionable parses. In all
these problematic cases, we manually select a more suitable instantiation for
the pattern, or eliminate the pattern altogether.
The resulting corpus contains 222 structures: 25 AdvPs, 47 AdjPs, 64 NPs, 13
PPs, 23 SBARs, and 50 Ss. Some examples of elicitation sentences and phrases
can be seen in listing below. The examples are depicted here together with
their parses and component sequences. The bilingual user that will translate
the sentences is only optionally presented with the parses and/or component
sequences. Since not all bilingual users of our system can be expected to be
trained in linguistics, it may be appropriate to present them simply with the
phrase or sentence to translate. Other context information, such as the parse as
222 K. Probst and A. Lavie
well as the complete sentence (if eliciting a phrase), can be provided. This can
help the user to disambiguate the phrase if necessary.
3 Multiple Corpora
In this section, we argue that an elicitation corpus as small as the one we describe
can be useful without losing important information. This is shown by creating
an increasingly redundant corpus and observing that the information gained
converges as redundancy increases, as described below.
One common problem is that lexical selection in the elicitation language can
lead to unexpected or non-standard translations. For example, when eliciting
the pattern ADJ N with a TL such as Spanish or French, depending
on the adjective in the example, it will occur either before or after the noun in
the TL. The ultimate goal of elicitation is to learn both the general rule (i.e.
adjective after the noun) as well as the exceptions; it is however more important
that we not miss the more general rule. This would happen if the elicitation
instance will contain an adjective that represents an exception. Redundancy in
the corpus can serve as a safeguard against this issue. We have therefore created
three corpora, each of which contain different examples for the same list of SL
patterns. Whenever possible, the structure of the training example was slightly
altered; the high-level structure, i.e. the component sequence, however, always
remained the same. For instance, each of the corpora contains an example of the
high-level structure (type and component sequence) N PP.
A Structurally Diverse Minimal Corpus for Eliciting Structural Mappings 223
For evaluation purposes, we have translated the three corpora into German
and have word-aligned the parallel phrases and sentences by hand. Two of the
corpora were also translated into Hebrew by an informant. Below, we evaluate
our structural elicitation corpus based on the translations obtained for German
and Hebrew.
In this case, the S ‘he would not sleep’ is a component, so that SL in-
dices 2-5 together form a component. The aligned TL indices form a set with
no alignments to any other component (other than the SL S), so that it can be
postulated that they form a TL component. However, things are not always this
simple. For instance, it can happen that there are 0-1 or 1-0 word alignments,
224 K. Probst and A. Lavie
Each corpus contains 222 SL patterns. When adding a second corpus for Ger-
man, we obtained an additional 52 patterns. The addition of the third corpus
resulted in only an additional 25 patterns. For Hebrew, we only have transla-
tions for two of the corpora available. It was found that in the first corpus, 209
unique component sequences and alignments were elicited. Some patterns (15)
were not translated; others (10) had more than one translation, while not all
translations resulted in different component analyses. The second corpus added
55 new patterns with unique component alignments, and an additional 15 that
were not translated in the first corpus. In the second corpus, 9 patterns were not
translated. It can be seen from the results in the below table that the addition
of corpora does add patterns that had not been observed before. However, each
additional corpus adds less to the number of patterns observed, as expected. The
main conclusion is that the number of additional patterns drops off very quickly.
With the third German corpus, only 11% of the patterns in the corpus result in a
component sequence that was previously unobserved. This leads us to argue that
we have good evidence that in the case of German, the most common structure
mappings appear to be covered already by the first two instances. The addition
of a third corpus adds additional redundancy and protection from information
loss. Hebrew is a more difficult case for elicitation, so that a third (and maybe
fourth) corpus appears to be advisable.
lexical choices. Thus some of the rules contain lexical items and are thus not as
general as they would be if a human grammar writer had designed them. This
means that we will often learn different rules for the same pattern, even if the
component alignment as described above is the same. This indicates a measure
of safeguarding in the training corpus. In order to determine how effective our
elicitation corpus is for learning rules, we trained our system on the three German
and two Hebrew corpora separately and measured how many unique rules are
learned in each case. The results can be seen in the table below.
As expected, there is more overlap in the rules learned for German between
the different corpora, because English and German are more closely related than
English and Hebrew. In particular, it was observed that many words in the
English-Hebrew rules are left lexicalized due to word alignments that were not
one-to-one. This again leads us to conclude that it would be useful to obtain
additional data.
Some examples of learned rules can be seen below. As was mentioned above,
the rules are learned for different types, so that they can combine compositionally
at run time.
References
1. Bouquiaux, Luc and J.M.C. Thomas. Studying and Describing Unwritten Lan-
guages, The Summer Institute of Linguistics, Dallas, TX, 1992.
2. Comrie, Bernard and N. Smith. Lingua Descriptive Series: Questionnaire, Lingua,
42, 1-72, 1977.
3. Lavie, Alon, S. Vogel, L. Levin, E. Peterson, K. Probst, A. Font Llitjos, R.
Reynolds, J. Carbonell, R. Cohen. Experiments with a Hindi-to-English Transfer-
based MT System under a Miserly Data Scenario, ACM Transactions on Asian
Language Information Processing (TALIP), 2:2, 2003.
4. Jones, Douglas and R. Havrilla. Twisted Pair Grammar: Support for Rapid Devel-
opment of Machine Translation for Low Density Languages, Third Conference of
the Association for Machine Translation in the Americas (AMTA-98), 1998.
5. Marcus, Mitchell, A. Taylor, R. MacIntyre, A. Bies, C. Cooper, M. Fergu-
son, A. Littmann. The Penn Treebank Project, http://www.cis.upenn.edu/ tree-
bank/home.html, 1992.
6. Probst, Katharina, R. Brown, J. Carbonell, A. Lavie, L. Levin, E. Peterson. Design
and Implementation of Controlled Elicitation for Machine Translation of Low-
density Languages, Workshop MT2010 at Machine Translation Summit VIII, 2001.
7. Probst, Katharina, L. Levin, E. Peterson, A. Lavie, J. Carbonell. MT for Resource-
Poor Languages Using Elicitation-Based Learning of Syntactic Transfer Rules, Ma-
chine Translation, Special Issue on Embedded MT, 2003.
8. Probst, Katharina and L. Levin. Challenges in Automated Elicitation of a Con-
trolled Bilingual Corpus, 9th International Conference on Theoretical and Method-
ological Issues in Machine Translation (TMI-02), 2002.
9. Sherematyeva, Svetlana and S. Nirenburg. Towards a Unversal Tool for NLP Re-
source Acquisition, Second International Conference on Language Resources and
Evaluation (LREC-00), 2000.
Investigation of Intelligibility Judgments
Florence Reeder
1 Introduction
Intelligibility is often measured through judgment on a one to five scale where five
indicates that the translation is as coherent and intelligible as if it were authored in the
target language. Varied definitions of intelligibility exist (e.g., [1], [11], [13]). The
definition provided through the ISLE work [6] reflects historical definitions and are
used here. Intelligibility is the “extent to which a sentence reads naturally.” Natural
reading can be the degree of conformance to the target language. In the case of this
experiment, the target language is English and it can be said to measure the degree of
Englishness.
In language teaching, Meara and Babi [7] tested assessors’ abilities to make a
distinction between Spanish native speakers (L1) and language learners (L2) for
written essays. The experiment’s main premise was that professional language
examiners tend to make snap judgments on student exams. They showed assessors
essays one word at a time, asking them to identify the author as L1 or L2. They
counted the number of words before an attribution was made as well as success rates
for the assessors. This is an intelligibility or coherence test since the essay topics
were open-ended.
For eighteen assessors and 180 texts, assessors could accurately attribute L1 texts
83.9% and L2 texts 87.2% of the time. Additionally, they found that assessors could
make the L1/L2 distinction in less than 100 words. It took longer to successfully
attribute the author’s language when the essay was written by a native speaker (53.9
words) than by a language learner (26.7 words). This means that the more intelligible
an essay was, the harder it was to recognize. They ascribe this to the notion that L1
writing “can only be identified negatively by the absence of errors, or the absence of
awkward writing” [7]. While they did not select features that evaluators used, they
hypothesized a “tolerance threshold” for low quality writing. Once the threshold had
R.E. Frederking and K.B. Taylor (Eds.): AMTA 2004, LNAI 3265, pp. 227–235, 2004.
© Springer-Verlag Berlin Heidelberg 2004
228 F. Reeder
been reached through either a major error or several missteps, the assessor could
confidently attribute the text. The results are intuitive in that one would expect the
presence of errors to be a strong determiner of intelligibility. On the other hand, it
represents a measurement of the intuition, and the number of words needed is
surprising small given the size of the average exam (over 400 words). The question
arises, “Can the test be adapted for human translation (HT) and machine translation
(MT) output?”
2 Experimental Setup
The data selected had been used in previous MT evaluations [13], representing
previously scored data. The DARPA-94 corpus consists of three language pairs,
French-English, Spanish-English and Japanese-English, containing one hundred
documents per language pair taken from news texts. The texts are roughly 400 words
apiece. For each source language document, two reference translations accompany the
MT outputs. DARPA-94 is well-suited here because it has human intelligibility
judgments. In the selected Spanish-English collection, five MT systems participated.
The first fifty available translations for each system were chosen as well as one of two
reference translations.1 Headlines were removed as they represent atypical language
usage. The extracts were taken from the start of each article. Sentence boundaries
were preserved, therefore, each text was between 98 and 140 words long. Each text
fragment was placed on a separate page along with an identifier and a scoring area.
Fifty participants were recruited where 98% were native English speakers and 2%
had native English fluency. Participants were divided between those at least familiar
with Spanish at 56%, and those with no Spanish at 44%. Subjects were heavily
weighted towards computer competency, with 84% as computing professionals and
none having no competence. MT knowledge was also spread between those with
some familiarity at 38%; those with some NLP or language capability at 40%; and
those with no familiarity, 22%.
Each subject was given a set of six extracts which were a mix of MT and HT. No
article was duplicated within a test set. Text order was mixed within a test set so that
the system and HT orders varied. Participants were given test sheets with a blank
piece of paper for overlay. They were instructed use the overlay to read the article
line-by-line and were told to circle the word at which they decided as soon as they
could differentiate between HT and MT. Half of the subjects were given no
information about the human’s expertise level, while the other half were told that the
HT was done by a professional translator. To facilitate snap judgment, subjects were
given up to three minutes per text, although few came close to the limit.
1 For one of the systems, nearly twenty documents were missing at the time of the experiment.
These twenty were at the end of the selected sample and were augmented from the next
group of 20.
Investigation of Intelligibility Judgments 229
Subjects were able to correctly attribute the translations 87.7% of the time. This
determination is slightly above that reported for the L1/L2 distinction which averaged
85.6% [7]. For MT systems, the recognition rate was 87.6% which is comparable to
the L2 attribution of 87.2% [7]. Surprisingly, at 88%, the successful attribution of HT
was higher than the native speaker attribution rate of 83.9% [7]. Therefore, HT can
be differentiated from MT in less than one hundred words.
In looking at the details of the scores, three of the fifty test subjects correctly
identified only 50% of the texts. Another three subjects missed two of the texts with
88% being able to differentiate with less than one misattribution. Of this 88%, the
group was evenly split between one error and no errors. Post-test interviews showed
that one judge assumed pathological cases, such as a non-native English speaker
translating Spanish to English, and another had little computer experience and
considered computers more capable than humans.
The assumption is that given the high rate of computer savvy personnel in testing, a
system producing less intelligible output or lower fluency scores would be more
accurately attributed as having been computer generated. Measurement is through
looking at the attribution accuracy (percent correct) versus the intelligibility (fluency)
score (Table 1). Systems with higher intelligibility scores are less accurately
differentiated from HT. As the fluency increases for systems, the percentage of
correctly attributed texts decreases. While this can be seen (Figure 1), analysis shows
a Pearson correlation of R = -0.89 (significant at 0.05). The negative correlation
reflects the fact that as fluency increases differentiation from HT decreases. The
Spearman correlation of -1 (significant at 0.01) is even stronger.
Given that the attributions were done on only part of the texts, we also looked at
the fluency scores assigned for those parts of the texts that were part of the test. The
original DARPA-94 scores were aggregates of individual sentence scores. Subjects
were shown a text one sentence at a time and asked to score that sentence on a 1-5
scale where a score of five represents the greatest fluency. For a given text, scores
were then averaged and divided by five resulting in a score which fell between zero
and one. To look at partial text fluency, we took the scores for individual sentences
used in the test, averaged them and divided by five. These are shown (third column of
Table 1). When looking at the score correlations to partial text fluency, the result is
close with a Pearson correlation of R = -0.85 (significant at 0.07). Partial fluency
scores correlate with overall fluency scores at R = 0.995.
While the percentage correct does correlate with fluency for MT systems, like with
other metrics such as BLEU [8], introducing HT causes the correlation to degrade.
When HT is added in, scores do not follow the trend of lower intelligibility
conforming to higher judgment accuracy with a correlation of -0.45 (significant at
0.32). This is because the HT examples have significantly higher fluency scores than
MT, yet are able to be distinguished from MT.
230 F. Reeder
include, the correlation improves to 0.95 (significant at 0.004). This confirms the idea
that the presence of errors contributes significantly to the distinction. Looking at
partial fluency and percentage of words as opposed to number of words improves the
scores to some extent for systems only).
Having established that the snap judgment test correlates well with intelligibility, it
remains to determine if there are error types which play a large part in the decision
process (Table 3). For this stage of the effort, only correctly attributed MT output is
analyzed. This decision is due to the fact that MT, rather than HT, is of interest.
Only correct attributions were included because these would provide useful
information for training a system to correctly attribute MT as well.
232 F. Reeder
Some error types are immediately apparent. Not translated words, other than proper
nouns, were generally immediate clues to the fact that a system produced the results.
Other factors include incorrect pronoun translation such as multiple pronouns output,
inconsistent preposition translation, and incorrect punctuation. To determine the
errors and their relative contribution, the results were annotated for certain broad error
categories. The problem of error attribution is well-known in both the language
teaching community (e.g., [5], [4] and in the MT community (e.g., [3], [12]). The
categories used, therefore, were designed to mitigate problems in error attribution. In
addition, the categories were derived from the data itself, particularly by looking at
errors near the word selected as the decision point word. A particular phrase can be
assigned multiple error types, particularly in the case of prepositional phrases versus
noun-noun compounding. For instance “the night of Sunday” contains both a
prepositional and order error from an intelligibility perspective. The items identified
here are consistent with those suggested by Flanagan [3]. In fact, three of the five
Class 1 errors identified by Flanagan (capitalization, spelling, article) are here. The
other two are specific to the target languages in her study (accent and elision).
The top five errors account for 65% of the errors found, so it is possible to theorize
that a system which could detect these could accurately measure intelligibility while
being diagnostic as well. The top five error categories are prepositions; word
ordering; capitalization errors; named entity errors and incorrect, extra or missing
articles. If the named entity article category is included in articles in general, then
articles are the third highest with 12.73% and the overall percentage of the top five is
over 70%. Surprisingly, not translated words account for only six percent of the error
categories, although these were repeatedly mentioned as a giveaway clue in the post-
test interviews. Performing a linear regression analysis on the individual categories
did not demonstrate a clear indicating factor for predicting intelligibility. We further
analyzed the results into category types.
Investigation of Intelligibility Judgments 233
Three types of errors fall into the surface errors category: capitalization errors; not-
translated words; and punctuation errors. Not-translated words are an obvious
indicator of machine ability. These could be accounted for explicitly. Not-translated
words accounted for 9.9% of the selected words. That is, when a word was selected as
the decision point, only in 9.9% of the cases was that word a not-translated one. On
the other hand, only 17% of the not-translated words were selected ones, therefore,
explicitly using these would is less indicative of decision factors than initially
assumed. Capitalization accounts for over 10% of the errors, another readily measured
feature, particularly for named entities and sentence beginnings.
Syntactic errors form the majority of errors in this analysis. Two types of categories
are those which have an edit distance component through the insertion, deletion and
substitution of lexical items. In this type are incorrect, extra or missing prepositions;
word ordering, including with incorrect noun-noun compounding and extra
prepositions (i.e., school of law rather than law school); incorrect, extra or missing
articles which do not include articles before named entities; incorrect, extra or missing
pronouns, including indefinite pronouns and lists of pronouns (i.e., he/she/it) cases.
The second type of errors is more related to agreement issues such as incorrect verb
form, particularly tense; incorrect negation, particularly verbal; incorrect conjunction
or coordination term used; and incorrect agreement in gender or number.
Only two error types are semantic: named entity errors and semantic errors. Named
entity errors have been studied ([9], [10]) and a tool for measuring these exists. Major
semantic errors will be difficult to diagnose and I limited these to ones primarily of
spatial, temporal or event relationship. For instance, one system repeatedly listed a
date as “east Tuesday” as opposed to “last Tuesday”. Work in capturing and
conveying semantic information is an active research area (e.g., [2]), so it is unlikely
that an automated system could be designed to measure these at this time.
Post-test interviews have consistently shown that the deciders utilized error spotting
techniques, although the types and sensitivities of the errors differed from subject to
subject. Some errors were serious enough to make the choice obvious, while others
had to occur more than once to push the decision above a threshold. One subject
reported a tendency to want confirmation, saying that they could have guessed at the
234 F. Reeder
first line, but wanted to see the second just to be sure. Others reported wanting to be
able to add a confidence measure to their judgments.
Test subjects used a variety of errors in vocabulary, grammar and style as
indicators. The consensus was that a text that was not perfect was machine-generated.
Lexical cues included “strange words”, not-translated words, and extra words
although misspellings were often forgiven. Other subjects recognized “convoluted
constructs” such as too many prepositional phrases in a row, phrases starting with
incorrect prepositions or verbal phrases in the infinitive form. Cumbersome sentence
constructions that “didn’t flow smoothly” involved choppy phrases strung together,
long sentences, word ordering and literal translations, with one participant describing
the grammar as “pathetic”.
Those participants who did not immediately attribute weak translations to MT
generally made either no assumptions of translation quality or attributed poor human
translation to some of the cases. One described assuming the translator to be a non-
native English speaker. Others supposed that since English is hard to learn and that
even human translators can make mistakes.
This work demonstrates that raters can make accurate intelligibility judgments with
very little data. The impact of this is that it is conceivable to build an automated
evaluation for intelligibility which relies on very little text, as opposed to those
metrics which rely on corpora of test data. In addition, there are specific error types
which contribute to the judgment which would enable diagnostic evaluation tools to
be constructed. While it is not news to the MT community that certain types of errors
contribute to judgments of poor intelligibility, it is unusual to have both the number
and types quantified. Having said this, work still needs to be done in several
dimensions. The first is to look at a system-by-system breakdown at the percentages
and plot this against intelligibility scores to determine the weight that each type of
error brings to the intelligibility score. Secondarily, a closer look at the error
occurring at the word selected may indicate definitive features for decision making.
Finally, this work proves out ideas of measuring “Englishness” of English-target MT.
On the other hand, it argues not for measuring the positive amount of English
conformance, but the negative degree of non-conformance to English standards. This
indicates that it may be possible to build an error detection and correction post MT-
module to improve MT output.
Acknowledgement. The views expressed in this paper are those of the author and do
not reflect the policy of the MITRE Corporation. Many thanks to the anonymous
reviewers for their astute and helpful comments.
Investigation of Intelligibility Judgments 235
References
[1] ALPAC. 1966. Language and machines: Computers in Translation and Linguistics. A
Report by the Automatic Language Processing Advisory Committee. Division of
Behavioral Sciences, National Academy of Sciences, National Research Council,
Washington, D.C.
[2] Farwell, D., Helmreich, S., Dorr, B., Habash, N., Reeder, F., Miller, K., Levin, L.,
Mitamura, T., Hovy, E., Rambow, O., Siddharthan, A. 2004. Interlingual Annotation of
Multilingual Text Corpora. In Proceedings of Workshop on Frontiers in Corpus
Annotation. NAACL/HLT 2004.
[3] Flanagan, M. 1994. Error Classification for MT Evaluation. In Technology Partnerships
for Crossing the Language Barrier: Proceedings of the First Conference of the
Association for Machine Translation in the Americas, Columbia, MD.
[4] Heift, G. 1998. Designed Intelligence: A Language Teacher Model. Unpublished Ph.D.
thesis. Simon Eraser University.
[5] Holland, V.M. 1994. Lessons Learned in Designing Intelligent CALL: Managing
Communication across Disciplines. Computer Assisted Language Learning, 7(3), 227-
256.
[6] Hovy E., King M. & Popescu-Belis A. 2002. Principles of Context-Based Machine
Translation Evaluation. Machine Translation, 17(1), p.43-75
[7] Meara, P. & Babi, A. 1999. Just a few words: how assessors evaluate minimal texts.
Vocabulary Acquisition Research Gp. Virtual Lib.
www.swan.ac.uk/cals/vlibrary/ab99a.html
[8] Papenini, K., Roukos, S., Ward, T. & Zhu, W-J. 2002. Bleu: a Method for Automatic
Evaluation of Machine Translation. Proceedings of ACL-2002, Philadelphia, PA.
[9] Papenini, K., Roukos, S., Ward, T., Henderson, J., & Reeder, F. 2002. Corpus-based
comprehensive and diagnostic MT evaluation: Initial Arabic, Chinese, French, and
Spanish results. In Proceedings of Human Language Technology 2002, San Diego, CA.
[10] Reeder, F., Miller, K., Doyon, J., White, J. 2001. “The Naming of Things and the
Confusion of Tongues: An MT Metric”, MT-Summit Workshop on MT Evaluation,
September.
[11] Van Slype, G. 1979. Critical Methods for Evaluating the Quality of Machine
Translation. Prepared for the European Commission Directorate General Scientific and
Technical Information and Information Management. Report BR-19142. Bureau Marcel
van Dijk.
[12] White, J. 2000. Toward an Automated, Task-Based MT Evaluation Strategy. In
Maegaard, B., ed., Proceedings of the Workshop on Machine Translation Evaluation at
LREC-2000. Athens, Greece.
[13] White, J., et al. 1992-1994. ARPA Workshops on Machine Translation. Series of 4
workshops on comparative evaluation. PRC Inc. McLean, VA.
Interlingual Annotation for MT Development
1 Introduction
R.E. Frederking and K.B. Taylor (Eds.): AMTA 2004, LNAI 3265, pp. 236–245, 2004.
© Springer-Verlag Berlin Heidelberg 2004
Interlingual Annotation for MT Development 237
2 Translation Divergences
A somewhat unique aspect of this project is its focus on multiple English translations
of the same text. By comparing the annotations of the source text and its translations,
any differences indicate one of three general problems: potential inadequacies in the
interlingua, misunderstandings by the annotators, or mistranslations. By analyzing
such differences, we can sharpen the IL definition, improve the instructions to
annotators, and/or identify the kinds of translational differences that occur (and decide
what to do about them).
Consider the first two paragraphs of K1E1 and K1E2 (two English translations of
Korean text K1):
K1E1: Starting on January 1 of next year, SK Telecom subscribers can
switch to less expensive LG Telecom or KTF. ... The Subscribers cannot
switch again to another provider for the first 3 months, but they can cancel
the switch in 14 days if they are not satisfied with services like voice quality.
K1E2: Starting January 1st of next year, customers of SK Telecom can
change their service company to LG Telecom or KTF ... Once a service
company swap has been made, customers are not allowed to change
companies again within the first three months, although they can cancel the
change anytime within 14 days if problems such as poor call quality are
experienced.
First, if the interlingua term repository contains different terms for subscriber and
customer, then the single Korean source term will have given rise to different
interpretations. Here, we face a choice: we can ask annotators to explicitly search for
near-synonyms and include them all, or we can compress the term repository to
remove these near-synonyms. Second, K1E1 contains less expensive, which K1E2
omits altogether. This could be either one translator’s oversight or another’s
incorporation of world knowledge into the translation. Third, is voice quality the
same as call quality? Certainly voice is not the same as call. What should the
interlingua representation be here—should it focus merely on the poor quality,
skirting the modifiers altogether? Before we could address these more intriguing
cases, we faced the necessity of building an IL representation which was increasingly
abstract while maintaining a level of feasibility from an annotation perspective. This
led to the approach described in the next section.
238 F. Reeder et al.
Fig. 1. Example of IL0 for “Sheikh Mohamed, who is also the Defense Minister of the United
Arab Emirates, announced at the inauguration ceremony that ‘we want to make Dubai a new
trading center.’”
4 Annotation Process
For any given subcorpus, the annotation effort involves assignment of IL content to
sets of at least 3 parallel texts, 2 of which are in English, and all of which theoretically
communicate the same information. Such a multilingual parallel data set of source-
language texts and English translations offers both unique perspective and problems
for annotating texts for meaning. Since we gather our corpora from disparate sources,
we standardize a text before presenting it to automated procedures. For English, this
means sentence boundary detection, but for other languages, it involves segmentation,
chunking of text, and other operations. The text is then parsed with Connexor [18],
and the output is viewed and corrected in TrED [9]. The revised deep dependency
structure produced by this process is the IL0 representation for that sentence. To
create IL1 from IL0, annotators use Tiamat, a tool developed specifically for this
project. This tool enables viewing the IL0 tree with easy reference to current IL
representation, ontology, and theta grids.
The annotators are instructed to annotate all nouns, verbs, adjectives, and adverbs.
This involves choosing all relevant concepts from Omega – both concepts from
Wordnet SYNSETs and those from Mikrokosmos. In addition, annotators are
instructed to provide a semantic case role for each dependent of a verb. LCS verbs are
identified with Wordnet classes and the LCS case frames are supplied where possible.
The annotator, however, is often required to determine the set of roles or alter them to
suit the text. In both cases, the revised or new set of case roles are noted and sent to a
reviewer for evaluation and possible permanent inclusion.
Three manuals comprise markup instructions: a users’ guide for Tiamat, including
procedural instructions, a definitional guide to semantic roles, and a manual for
creating a dependency structure (IL0). Together these manuals allow the annotator to
understand (1) the intention behind aspects of the dependency structure; (2) how to
use Tiamat to mark up texts; and (3) how to determine appropriate semantic roles and
ontological concepts. In choosing a set of appropriate ontological concepts, annotators
are encouraged to look at the name of the concept and its definition, the name and
definition of the parent node, example sentences, lexical synonyms attached to the
same node, and sub- and super-classes of the node.
Interlingual Annotation for MT Development 241
5 Evaluation Methodology
In order to evaluate the annotators’ output, an evaluation tool was also developed to
compare the output and to generate the evaluation measures. The reports generated by
the evaluation tool allow the researchers to look at both gross-level phenomena, such
as inter-annotator agreement, and at more detailed points of interest, such as lexical
items on which agreement was particularly low, possibly indicating gaps or other
inconsistencies in the ontology. Two evaluation strategies have been applied: inter-
annotator agreement and annotator reconciliation. For inter-annotator agreement, the
annotated decisions for each word and each theta role are recorded. Agreement is
measured according to the number of annotators that selected a particular role or
sense. For annotator reconciliation, each annotator reviews the selections made by
the other annotators and votes (privately) as to whether or not it is acceptable. The
annotators discuss results, followed by a second private vote. Since the reconciliation
process is on-going, we will not report these results.
In the first evaluation path, we measured inter-annotator agreement. While Kappa
[2] is a standard measure for inter-annotator agreement, measuring it for this project
presented difficulties. Calculating agreement and expected agreement when a number
of annotators can assign zero or more senses per word is not straight-forward. Also,
because of multiple annotators, we calculate an average of pair-wise agreement per
word for all annotator pairs. Because multiple categories (senses) can be assigned for
each word, we are faced with two decisions: a) do we count explicit agreement, i.e.
the annotators select the same sense; or b) implicit agreement, when the two
annotators do not select the same sense? Also, we must account for cases when no
concept is provided in Omega. Later we can explore the option of applying weighting
to Kappa using Omega’s hierarchical structure to compute similarity amongst options.
Two approaches are described.
For a specific word and pair of annotators who have made one or more selections
of semantic tags, agreement is measured as the ratio of the number of agreeing
selections to the number of all selections. Agreement is measured based on positive
selections only, i.e., when the two annotators select the same semantic tag as opposed.
For a word W, with a set of n possible semantic tags the function is defined as
the sum of the selections made by the two annotators and Pair-wise agreement
for a specific word is defined as:
Pair-wise agreement is measured as the average of agreement over all the words in a
document. The overall inter-annotator agreement is measured as the average of pair-
wise inter-annotator agreement of every pair of annotators. To calculate Kappa, we
estimate chance agreement by a random 100-fold simulation where the number of
concepts selected and concepts selected are randomly assigned, restricted by the
number of concepts per word in Omega. If Omega has no concepts associated with
the word, the chance agreement is computed as the inverse of the size of all of Omega
Interlingual Annotation for MT Development 243
When using this calculation method, it is possible to have a zero denominator when
both annotators pick all of or none of the senses. These cases are counted separately
and are removed from the calculation. No weighting is used at this time.
6 Results
The dataset has six pairs of English translations (250 words apiece) from each of the
source languages. The ten annotators were asked to annotate nouns, verbs, adjectives
and adverbs only with Omega concepts. The annotators selected one or more concepts
from both WordNet and Mikrokosmos-derived nodes. Annotated verb arguments
were assigned thematic roles. An important issue in the data set is the problem of
incomplete annotations which may stem from: (1) lack of annotator awareness of
missing annotations; (2) inability to finish annotations because of an intense schedule;
and (3) ontology omissions for words for which annotators selected DummyConcept
or no annotation at all. For 1,268 annotated words, 368 (29%) have no Omega
WordNet entry and 575 (45%) do not have Omega Mikrokosmos entries.
To address incomplete annotations, we calculate agreement in two different ways
that exclude annotations (1) by annotator and (2) by word. In the first calculation, we
exclude all annotations by an annotator if annotations where incomplete by more than
a certain threshold. Table 1 shows the average number of included annotators over all
documents (A#), the Average Pair-wise Agreement (APA) and Kappa for the theta
roles, the Mikrokosmos portion of Omega and the WordNet portion of Omega. The
table is broken down by different thresholds for exclusion.
Again, since annotators did not annotate some texts or failed to choose an Omega
entry, two types of agreement are reported. The first is agreement based on counting
cases where all senses are marked with zero as perfect agreement with a kappa of 1;
the second excludes zero cases entirely (Table 2). In eliminating zero pairs the
agreement does not change significantly.
244 F. Reeder et al.
Since one goal is to generate an IL representation that is useful for MT, we plan to
measure the ability to generate accurate surface texts from the IL representation as
annotated. This work will involve obtaining EXERGE [7] and Halogen [11] and
writing a converter between the IL format and those expected by generation tools.
We can then compare the generated sentences with source texts through a variety of
standard MT metrics. This will serve to determine if the elements of the
representation language are sufficiently well-defined and if they serve as a basis for
inferring interpretations from semantic representations or (target) semantic
representations from interpretations. This approach is limited to English generation
for the first pass as these tools are more readily available.
By providing an essential, and heretofore non-existent, data set for training and
evaluating natural language processing systems, the resultant annotated multilingual
corpus of translations is expected to lead to significant research and development
opportunities for machine translation and a host of other natural language processing
technologies, including question answering (e.g., via paraphrase and entailment
relations) and information extraction. Because of the unique annotation processes in
which each stage (IL0, IL1, IL2) provides a different level of linguistic and semantic
information, a different type of natural language processing can take advantage of the
information provided at the different stages. For example, IL1 may be useful for
information extraction in question answering, whereas IL2 might be the level that is
of most benefit to machine translation. These topics exemplify the research
investigations that we can conduct in the future, based on the results of the annotation.
Interlingual Annotation for MT Development 245
Acknowledgement. This work has been supported by NSF ITR Grant IIS-0326553.
References
[1] Bateman, J.A., Kasper, R.T., Moore, J.D., & Whitney, R.A. 1989. A General
Organization of Knowledge for Natural Language Processing: The Penman Upper
Model. Unpublished research report, USC/Information Sciences Institute, Marina del
Rey, CA.
[2] Carletta, J. C. 1996. Assessing agreement on classification tasks: the kappa statistic.
Computational Linguistics, 22(2), 249-254
[3] Dorr, B. 1993. Machine Translation: A View from the Lexicon, MIT Press, Cambridge,
MA.
[4] Dorr, B. 2001. LCS Verb Database, Online Software Database of Lexical Conceptual
Structures and Documentation, University of Maryland.
http://www.umiacs.umd.edu/~bonnie/LCS_Database_Documentation.html
[5] Farwell, D., Helmreich, S., Dorr, B., Habash, N., Reeder, F., Miller, K., Levin, L.,
Mitamura, T., Hovy, E., Rambow, O., Siddharthan, A. 2004. Interlingual Annotation of
Multilingual Text Corpora. In Proceedings of Workshop on Frontiers in Corpus
Annotation. NAACL/HLT 2004.
[6] Fellbaum, C. (ed.). 1998. WordNet: An On-line Lexical Database and Some of its
Applications. MIT Press, Cambridge, MA.
[7] Habash, N. 2003. Matador: A Large Scale Spanish-English GHMT System. In
Proceedings of the MT Summit, New Orleans, LA.
[8] Habash, N., B. Dorr, & D. Traum, 2002. “Efficient Language Independent Generation
from Lexical Conceptual Structures,” Machine Translation, 17:4.
[9] Haji , J; Vidová-Hladká, B; Pajas, P. 2001. The Prague Dependency Treebank:
Annotation Structure and Support. In Proceeding of the IRCS Workshop on Linguistic
Databases, University of Pennsylvania, Philadelphia, USA, pp. 105-114.
[10] Hirst, G.. 2003. Paraphrasing paraphrased. Invited talk at Second International Workshop
on Paraphrasing, 41st Annual Meeting of the ACL, Sapporo, Japan.
[11] Knight, K. and Langkilde, I. 2000. Preserving Ambiguities in Generation via Automata
Intersection. American Association for Artificial Intelligence conference (AAAI’00).
[12] Knight, K., & Luk, S.K. 1994. Building a Large-Scale Knowledge Base for Machine
Translation. Proceedings of AAAI. Seattle, WA.
[13] Kozlowski, R., McCoy, K., Vijay-Shanker, K. 2003. Generation of Single-Sentence
Paraphrases from Predicate/argument Structure using Lexico-grammatical Resources.
Second International Workshop on Paraphrasing, 41st ACL, Sapporo, Japan.
[14] Mahesh, K., & Nirenberg, S. 1995. A Situated Ontology for Practical NLP. Proc. of
Workshop on Basic Ontological Issues in Knowledge Sharing at IJCAI-95. Montreal,
Canada.
[15] Mitamura, T., E. Nyberg, J. Carbonell. 1991. An Efficient Interlingua Translation
System for Multilingual Document Production. Proc. of MT Summit. Washington,
DC.
[16] Philpot, A., M. Fleischman, E.H. Hovy. 2003. Semi-Automatic Construction of a General
Purpose Ontology. Proc. of the International Lisp Conference. New York, NY. Invited.
[17] Rinaldi, F., Dowdall, J., Kaljurand, K., Hess, M., Molla, D. 2003. Exploiting Paraphrases
in a Question Answering System. International Workshop on Paraphrasing, 41st ACL.
[18] Tapanainen, P. & T Jarvinen. 1997. A non-projective dependency parser. In the
Conference on Applied Natural Language Processing, Washington, DC.
Machine Translation of Online Product Support Articles
Using a Data-Driven MT System
Stephen D. Richardson
1 Introduction
The NLP group at Microsoft Research has created and begun internal deployment
throughout Microsoft of MSR-MT [1], a data-driven machine translation (DDMT)
system that has been trained on over a million translated sentences taken from prod-
uct documentation and support materials in English and each of four languages:
French, German, Japanese, and Spanish. MSR-MT has been used to translate Micro-
soft’s Product Support Services (PSS) knowledge base into each of these languages.
A multitude of additional opportunities to use MSR-MT exists at Microsoft, including
in product localization and in many other groups like PSS, where translation of large
amounts of material has not yet been considered because of cost and time constraints.
Microsoft stands to save or otherwise realize the value of tens of millions of dollars in
translation services annually using MSR-MT.
With an annual translation budget of hundreds of millions of dollars, Microsoft is
still unable to translate massive amounts of documentation and other materials. The
public PSS knowledge base, for example, contains over 140K articles and 80M words
of text. Because of translation costs, generally only a few thousand articles have been
R.E. Frederking and K.B. Taylor (Eds.): AMTA 2004, LNAI 3265, pp. 246–251, 2004.
© Springer-Verlag Berlin Heidelberg 2004
Machine Translation of Online Product Support Articles 247
translated into each of the major European and Asian languages annually, providing
only a sampling of online support to a growing international customer base. Mean-
while, hundreds more articles are added and/or updated on a weekly basis. Increasing
costly phone support has been the only solution in the past to this chronic problem for
PSS. Groups responsible for the content available on the Microsoft Developer Net-
work (MSDN) and Microsoft’s Technet are facing similar challenges. There are yet
other groups at Microsoft whose budgets have not yet begun to allow them to think
about translating their materials generally, especially those customized for and tar-
geted to specific international customers.
Using translation memory (TM) tools such as TRADOS, the Microsoft localization
community has been able to realize substantial savings in translating product docu-
mentation, which is often highly repetitive, “recycling” anywhere from 20% to 80%
of translated sentences. But with a company-wide average recycling rate of around
40%, there is still a greater portion of text that must be translated from scratch, thus
incurring costs averaging from 20 to 50 cents per word, depending on the language.
Text volumes, together with the translation budget, continue to increase.
Facing escalating translation and phone support costs, PSS approached an MT vendor
a few years ago about the possibility of using their commercial system. The vendor
proposed a pilot to show how their system could be (manually) customized to pro-
duce better quality machine translations. For English to Spanish, $50K was requested
to cover the pilot customization period of a few months, with the understanding that
this would lead to a full-fledged customization and ongoing maintenance agreement.
The initial and projected costs were a formidable barrier to acceptance by PSS of this
customized MT system.
PSS then turned to the Microsoft Research’s NLP group for help. An agreement
was reached through which PSS supported the finishing touches on MSR-MT for an
English-to-Spanish pilot.
After a period of further development, MSR-MT was trained overnight on a few
hundred thousand sentences culled from Microsoft product documentation and sup-
port articles, together with their corresponding translations (produced by human lo-
calizers using the TRADOS translation memory tool). As reported at AMTA 2002
[2], the system was deployed and over 125,000 articles in the knowledge base (KB)
were automatically translated into Spanish, indexed, and posted to a pilot web-site. A
few months later, customer satisfaction with the articles, as measured by surveying a
small sample of the approximately 60,000 visits to the web site, averaged 86% -- 12
points higher than for the English KB!
It appears that the Spanish users were so happy to have all the articles in their own
language that they were willing to overlook the fact that their quality was less than
that of human translations. Nevertheless, the “usefulness” rate (i.e., the percentage of
customers feeling that an article helped solve their problem) for the machine trans-
lated articles was about 50%, compared to 51% for human translated Spanish articles
248 S.D. Richardson
and just under 54% for English articles. PSS management was excited to see that the
potential of MSR-MT to lower support line call volume could be nearly the same as
for human-translated articles.
Based on the results of the pilot experiment, PSS decided on a permanent deploy-
ment of MSR-MT for Spanish. In April 2003, articles translated by MSR-MT, inter-
spersed with (many fewer) human translated ones, went live for Spanish-speaking
countries at http://support.microsoft.com. One may access the Spanish articles by
visiting the web site, clicking on “International Support,” and choosing “Spain” as the
country. Spanish queries may then be entered for the KB and pointers to both human
and machine translated articles will be listed, the later being indicated by the presence
of an icon next to the title containing two small gears.
For the five month period from September 2003 through January 2004, the perma-
nent deployment of the Spanish KB achieved a 79% customer satisfaction rate (com-
pared to 86% during the pilot and 73% for the original US English KB) and solid
55% usefulness rate (compared to 50% during the pilot and 57% for the US English).
While the satisfaction rate has levelled off a bit as users have apparently become
accustomed to the availability of KB articles in their language, it is still higher than
the original English. Thus more continues to be better in spite of imperfect transla-
tions, with 20 times more articles in Spanish than before MT output was available.
We attribute the rise in the usefulness rate in part to the fact that the coverage and
accuracy of MSR-MT were significantly enhanced after the pilot and before the per-
manent deployment by increasing the set of bilingual sentence pairs used to train the
system from 350K to 1.6M. This was achieved by gathering data from additional
translation memories for many more products and newer versions of products. We
deemed this especially important after the pilot as we observed a number of sparse
data deficiencies due to the vast variety of products discussed in the KB articles. The
result was a 10% jump in BLEU score (from .4406+/-.0162 to .4819+/-.0177) on a
test set of PSS article sentences for which we had human translations.
Machine Translation of Online Product Support Articles 249
With the success of the Spanish KB, our next (and more ambitious) target was a
Japanese version. After training MSR-MT with over 1.2M sentence pairs, the Japa-
nese pilot KB (with 140K+ articles) was deployed during the last two months of
2003. For a language that is admittedly tougher to translate and a user community that
has a reputation for being hard to please, the overall satisfaction rate for the modest
pilot was a surprising 71% and the usefulness rate was 56%—both very comparable
to the original US English rates. Table 1 compares the customer survey results for the
Japanese and Spanish pilots together with the permanent deployment survey results
for Spanish and US English. The success of this pilot led to a permanent deployment
of the Japanese KB in March 2004, containing both human and machine translated
articles in like fashion to the Spanish KB. With careful scrutiny of and feedback on
the Japanese KB by internal Microsoft users as well as external users, an updated
version of the KB was posted online in June 2004 and is enjoying a very positive
reception (to be reported when this paper is presented). A screen shot from one of the
articles in the Japanese KB is displayed in Figure 1.
In the first quarter of 2004, pilots were begun of both French and German versions
of the KB, translated by MSR-MT. It is anticipated that permanent deployments for
these languages will be made available later this year. Work is also ongoing to create
versions of MSR-MT capable of translating from English into Italian, Chinese, and
Korean, as well as into other languages important to Microsoft’s international busi-
ness.
250 S.D. Richardson
To address the need to reduce increasing localization costs where polished transla-
tions are required, we have integrated MSR-MT into the TRADOS translator’s work-
bench. In the absence of an exact recycled alternative, we provide a machine-
translated suggestion in the translation memory (TM) that the human translator can
choose and edit if desired. This results in a measurable increase in translation
throughput. In a recent experiment conducted in a tightly controlled usability lab
setting, 3 translators translated 16 different documents with and without MT output in
the TM, and were shown with statistical significance to be 35% faster with the MT
output than without it. Details of this experiment will be reported separately in the
future.
In the process of experimenting with MT post-editing, we have confirmed what
others have already observed: that consideration of human factors is crucial, and that
training is required to maximize post-editing efficiency. A number of MT post-editing
pilots are in progress or planned for this year, involving the four languages currently
supplied by MSR-MT. In facilitating localization, as in the publication of raw MT for
certain applications, Microsoft stands to realize savings of millions of dollars.
Another area for potential cost savings using MSR-MT is in dealing with the prod-
uct feedback recorded by a group within PSS that analyzes customer concerns, new
feature requests, and customer task scenarios as they are reported during customer
support phone calls. Previously, only feedback coming from English-speaking cus-
tomers was analyzed and channelled back to the product groups, as there were no
means nor translation budget to handle the growing volume of cases (now about 50%
of all cases worldwide) from non-English-speaking users. Efforts are underway to
make use of MSR-MT, which is currently trained to translate both to and from Eng-
lish and the four languages mentioned above, to enable the translation of customer
Machine Translation of Online Product Support Articles 251
cases into English, and the subsequent analysis of this data for the improvement of
Microsoft’s products.
Finally, we have provided limited availability of MSR-MT as a web service on
Microsoft’s internal corporate network to users of Word 2003 (which includes just
about everyone) through the translation function located in the Task Pane. By default
this function provides access to party MT providers via the Internet. Currently, the
same version of MSR-MT, trained to translate Microsoft technical texts (such as PSS
articles) to and from English and the four languages previously mentioned (and also
including a Chinese to English pair), is available either on a server or as a download-
able service to run on the client’s machine. Deployment of MSR-MT in this context
enables a variety of other uses, and provides a means for groups to explore other
applications of MT in their own areas of responsibility.
References
1. Richardson, S., Dolan, W., Menezes, A., Pinkham, J.: Achieving commercial-quality trans-
lation with example-based methods. In: Proceedings of MT Summit VIII, Santiago de
Compostela, Spain (2001) 293-298
2. Dolan, W., Pinkham, J., Richardson, S.: MSR-MT: The Microsoft Research machine trans-
lation system. In: Machine Translation: From Research to Real Users: Proceedings of the
AMTA 2002 Conference. Tiburon, California, USA (2002) 237-239
3. Flanagan, M. and McClure, S. IDC Bulletin #25019, June (2001)
4. Shore, R. Cisco Systems and SYSTRAN: an ongoing partnership in MT. Unpublished user
presentation at AMTA 2002 Conference, Tiburon, California, USA (2002)
Maintenance Issues for Machine Translation Systems
Nestor Rychtyckyj
1 Introduction
We have been utilizing a Machine Translation system at Ford Motor Company since
1998 within Vehicle Operations for the translation of our process assembly build
instructions from English to German, Spanish, Portuguese and Dutch. This system
was developed in conjunction with SYSTRAN Software Inc. and is an integral part of
our worldwide process planning system for manufacturing assembly. The input to our
system is a set of process build instructions that are written using a controlled lan-
guage known as Standard Language. The process sheets are read by an artificial in-
telligence (AI) system that parses the instructions and creates detailed work tasks for
each step of the assembly process. These work tasks are then released to the assembly
plants where specific workers are allocated for each task. In order to support the as-
sembly of vehicles at plants where the workers do not speak English, we utilize MT
technology to translate these instructions into the native language of these workers.
Standard Language is a restricted subset of English and contains a limited vocabulary
R.E. Frederking and K.B. Taylor (Eds.): AMTA 2004, LNAI 3265, pp. 252–261, 2004.
© Springer-Verlag Berlin Heidelberg 2004
Maintenance Issues for Machine Translation Systems 253
of about 5000 words that also include acronyms, abbreviations, proper nouns and
other Ford-specific terminology. In addition, Standard Language allows the process
sheet writers to embed comments within Standard Language sentences. These com-
ments are ignored by the AI system during its processing, but have to be translated by
the MT system. Standard Language also utilizes some structures that are grammati-
cally incorrect and create problems during the MT process. Therefore, the develop-
ment of a translation system for these requirements entailed considerable customiza-
tion to the SYSTRAN translation engines as well as a lot of effort in building the
technical glossaries to enable correct translation of Ford-specific manufacturing ter-
minology. We described the process of building this system in a previous paper [1];
our focus of this paper is to discuss the process and methodology of MT system
maintenance.
Section 2 will provide an overview of the existing manufacturing process at Ford
and describe how MT is part of that process. In Section 3 we will discuss Standard
Language in more detail and illustrate some of the issues that need to be addressed
during the translation process. The next section will discuss the process of MT main-
tenance and show what needs to be done to keep our technical glossaries and transla-
tion software up to date. We will also describe some of the new tools that have been
developed by SYSTRAN in order to facilitate this process. Our paper concludes with
a discussion of future work and our view of the role that MT will continue to play in
Ford’s manufacturing processes.
The machine translation system utilized at Ford is integrated into the Global Study
Process Allocation System (GSPAS). The goal of GSPAS is to incorporate a stan-
dardized methodology and a set of common business practices for the design and
assembly of vehicles to be used by all assembly plants throughout the world. GSPAS
allows for the integration of parts, tools, process descriptions and all other informa-
tion required to build a motor vehicle into one system and provides the engineering
and manufacturing communities a common platform and toolset for manufacturing
process planning. GSPAS utilizes Standard Language as a requirement for writing
process build instructions and we have deployed an MT solution for the translation of
these process build instructions.
The translation process at Ford for our manufacturing build instructions is fully
automated and does not require manual intervention. All of the process build instruc-
tions are stored within an Oracle database; they are written in English and validated
by the AI system. AI validation consists of parsing the Standard Language sentence,
analyzing it and matching the description to the appropriate work description in the
knowledge base and creating an output set of work instructions, their associated
MODAPTS codes and time required to perform each operation. MODAPTS codes
(Modular Arrangement of Predetermined Time Standards) are used to calculate the
time required to perform these actions. MODAPTS is an industrial measurement
254 N. Rychtyckyj
system around the world [2]. A more complete description of the GSPAS AI system
can be found in [3].
After a process sheet is validated and the AI system generates the appropriate
MODAPTS codes and times, a process engineer will release the process sheet to the
appropriate assembly plants. A vehicle that is built at multiple plants needs to have
these process sheets sent to each of these assembly plants. The information about
each local plant is stored in the database and those plants that require translation are
picked out by the system. The system then selects the process sheets that require
translation and starts the daily translation process for each language. Currently we
translate the process build instructions for 24 different vehicles into the appropriate
language. English-Spanish is the most commonly used language pair as it supports
our assembly plants in Spain, Mexico and South America.
The machine translation system was implemented into GSPAS through the devel-
opment of an interface into the Oracle database. Our translation programs extract the
data from an Oracle database, utilize the SYSTRAN system to complete the actual
translation, and then write the data back out to the Oracle database.
Our user community is located globally. The translated text is displayed on the
user’s PC or workstation using a graphical user interface through the GSPAS system.
It runs on Hewlett Packard workstations under the HP UNIX operating system. The
Ford multi-targeted customized dictionary that contains Ford technical terminology
was developed in conjunction with SYSTRAN and Ford based on input from engi-
neers and linguists familiar with Ford’s terminology.
One of the most difficult issues in deploying any translation is the need to get con-
sistent and accurate evaluation of the quality of your translations (both manual and
machine). We are using the J2450 metric developed by the Society of Automotive
Engineers (SAE) as a guide for our translation evaluators [4]. The J2450 metric was
developed by an SAE committee consisting of representatives from the automobile
industry and the translation community as a standard measurement that can be applied
to grade the translation quality of automotive service information. This metric pro-
vides guidelines for evaluators to follow and describes a set of error categories,
weight of the errors found and calculates a score for a given document. The metric
does not attempt to grade style, but focuses primarily on the understandability of the
translated text. The utilization of the SAE J2450 metric has given us a consistent and
tangible method to evaluate translation quality and identify which areas require the
most improvement.
We have also spent substantial effort in analyzing the source text in order to iden-
tify which terms are used most often in Standard Language so that we can concentrate
our resources on those most common terms. This was accomplished by using the
parser from our AI system to store parsed sentences into the database. Periodically,
we run an analysis of our parsed sentences and create a table where our terminology
is listed in use of frequency. This table is then compared to the technical glossary to
ensure that the most-commonly used terms are being translated correctly. The fre-
quency analysis also allows us to calculate the number of terms that need to be trans-
lated correctly to meet a certain translation accuracy threshold.
Maintenance Issues for Machine Translation Systems 255
3 Standard Language
Standard Language is a controlled language that provides for the expression of im-
perative English assembly instructions at any level of detail. All of the terms in Stan-
dard Language with their pertinent attributes are stored in the GSPAS knowledge
base in the form of a semantic network-based taxonomy as shown in Figure 2. Certain
word categories in the language possess specific semantics as defined by the engi-
neering community. Verbs in the language are associated with specific assembly
instructions and are modified by significant adverbs where appropriate. For example,
the phrases inspect, visually inspect and manually verify all have different interpreta-
256 N. Rychtyckyj
tions Information on tools and parts that are associated with each process sheet is
used to provide extra detail and context.
The Standard Language sentence is written in the imperative form and must con-
tain a verb phrase and a noun phrase that is used as the object of the verb. Any addi-
tional terms that increase the level of detail, such as adverbs, adjuncts and preposi-
tional phrases are optional and may be included at the process writer’s discretion.
The primary driver of any sentence is the verb that describes the action that must be
performed for this instruction The number of Standard Language verbs is limited and
each verb is defined to describe a single particular action. For example, the verbs
position and seat have different meanings and cannot be used interchangeably. The
object of the verb phrase is usually a noun phrase that describes a particular part of
the vehicle, tool or fastener. Standard Language allows the usage of modifiers that
provide additional detail for those objects. The process sheet writer may use preposi-
tional phrases to add more detail to any sentence. Certain prepositions have specific
meaning in Standard Language and will be interpreted in a predetermined manner
when encountered in a sentence. For example, the preposition using will always sig-
nify that a tool description will follow. Figure 2 shows how the Standard Language
sentence “Feed 2 150 mm wire assemblies through hole in liftgate panel” is parsed
into its constituent cases.
4 System Maintenance
As previously discussed, we have spent considerable time and effort to create a set of
customized technical glossaries that are used during the translation process. These
glossaries were developed in conjunction with SYSTRAN and with subject matter
experts from Ford Motor Company. However, since Standard Language and the Ford
258 N. Rychtyckyj
Another important facet in dictionary maintenance deals with the analysis and
customization of the source text. We have previously described some of the tech-
niques we have been using to clean up the source text to improve translation quality.
In this section we will discuss additional capabilities we have added into the system to
improve the translation of the free-form text. A Standard Language element may
contain embedded free-form text that is ignored by the AI system; however this text
must be translated and sent to the assembly plants. This free-form text usually con-
sists of additional information that may be useful to the operator on the assembly line.
Below is an example of Standard Language with embedded free-form text comments.
The text inside the curly brackets {TAPE SIDE UP} is not really part of the sen-
tence; it actually describes the position of the “mouldings”. Therefore, a translation
system that processes this sentence as one entity would not generate an accurate
translation. We need to be able to tell the system that the clause inside the curly
brackets should be treated independently of the rest of the sentence. This problem is
solved by embedding tags into the source text before it gets translated. These tags
function identify comments and provide the translation program with information
about how these comments should be translated. Short comments are processed dif-
ferently from long comments within Standard Language regarding translation pa-
rameters (dictionaries and segmentation).
260 N. Rychtyckyj
Another facet of system maintenance deals with the underlying software architec-
ture that supports our translation system. Translation in GSPAS involves a set of
programs that communicate with a database as well as with the translation engines
and technical glossaries. Most changes to the translation engine processing also re-
quire changes to the translation pre-processing programs. In addition, modifications
to the database model or upgrades to the operating system require extensive testing
and validation of the translation results.
In this paper we discussed some of the issues related to the maintenance of a machine
translation application at Ford Motor Company. This application has been in place
since 1998 and we have translated more than 2 million records describing build in-
structions for vehicle assembly at our plants in Europe, Mexico and South America.
The source text for our translation consists of a controlled language, known as Stan-
dard Language, but we also need to translate free-form text comments that are em-
bedded within the assembly instructions. The most difficult issue in the development
of this system was the construction of technical glossaries that describe the manufac-
turing and engineering terminology in use at Ford. Our application uses a customized
version of the SYSTRAN Software translation system coupled with a set of Ford-
specific dictionaries that are used during the translation process. The automotive
industry is very dynamic and we need to be able to keep our technical glossaries cur-
rent and to develop a process for updating our system in a timely fashion.
The solution to our maintenance issues was the development and deployment of
the SYSTRAN Review Manager. This web-based tool allows our users the capability
to test and update the technical glossaries as needed. This has reduced our turnaround
time for deploying changes to the dictionaries from 2 months to less than 48 hours.
The SYSTRAN Review Manager runs on an internal Ford server and is available for
use by our internal customers.
System maintenance is an on-going issue. We still require additional capabilities
to improve our translation accuracy and to expand our system to other type of source
data, including part and tool descriptions. We have already introduced XML tagging
into our free-form comment translation and are working with SYSTRAN to enhance
that capability and improve translation accuracy. Our current AI system in GSPAS
already parses Standard Language into its components and we would like to pass that
information over to the translation system to improve the sentence understanding that
should lead to higher accuracy.
Our experience with Machine Translation technology at Ford has been positive; we
have shown that customization of a translation system can lead to very good results.
It is also essential to put a process in place that allows for the timely testing and up-
grades to the technical glossaries. We are confident that further enhancements to the
technology, such as tagging of terminology, will lead to better results in the future
and improve the use and acceptance of machine translation in the corporate world.
Maintenance Issues for Machine Translation Systems 261
References
1. Rychtyckyj, N., (2002), “An Assessment of Machine Translation for Vehicle Assembly
Process Planning at Ford Motor Company”, Machine Translation: From Research to Real
Users, Conference of the Association for Machine Translation in the Americas, AMTA
2002 Tiburon, CA, USA, October 2002 Proceedings, pp. 207-215, Springer-Verlag.
2. International MODAPTS Association (IMA) (2000), “MODAPTS: Modular Arrangement
of Predetermined Time Standards -- The Language of Work”
3. Rychtyckyj, N.: “DLMS: Ten Years of AI for Vehicle Assembly Process Planning,”
AAAI-99/IAAI-99 Proceedings, Orlando, FL, AAAI Press (1999) 821-828
4. Society of Automotive Engineers: J2450 Quality Metric for Language Translation,
www.sae.org (2002)
Improving Domain-Specific Word Alignment with a
General Bilingual Corpus
1 Introduction
R.E. Frederking and K.B. Taylor (Eds.): AMTA 2004, LNAI 3265, pp. 262–271, 2004.
© Springer-Verlag Berlin Heidelberg
Improving Domain-Specific Word Alignment with a General Bilingual Corpus 263
two kinds of words. Some are general words, which are also frequently used in the
general domain. Others are domain-specific words, which only occur in the specific
domain. In general, it is not quite hard to obtain a large-scale general bilingual corpus
while the available domain-specific bilingual corpus is usually quite small. Thus, we
use the bilingual corpus in the general domain to improve word alignments for
general words and the bilingual corpus in the specific domain for domain-specific
words. In other words, we will adapt the word alignment information in the general
domain to the specific domain.
Although the adaptation technology is widely used for other tasks such as
language modeling, few literatures, to the best of our knowledge, directly address
word alignment adaptation. The work most closely related to ours is the statistical
translation adaptation described in [7]. Langlais used terminological lexicons to
improve the performance of a statistical translation engine, which is trained on a
general bilingual corpus and used to translate a manual for military snipers. The
experimental results showed that this adaptation method could reduce word error rate
on the translation task.
In this paper, we perform word alignment adaptation from the general domain to a
specific domain (in this study, a user manual for a medical system) with four steps.
(1) We train a word alignment model using a bilingual corpus in the general domain;
(2) We train another word alignment model using a small-scale bilingual corpus in the
specific domain; (3) We build two translation dictionaries according to the alignment
results in (1) and (2) respectively; (4) For each sentence pair in the specific domain,
we use the two models to get different word alignment results and improve the results
according to the translation dictionaries. Experimental results show that our approach
improves domain-specific word alignment in terms of both precision and recall,
achieving a 21.96% relative error rate reduction.
The remainder of the paper is organized as follows. Section 2 introduces the
statistical word alignment method and analyzes the problems existing in this method
for the domain-specific task. Section 3 describes our word alignment adaptation
algorithm. Section 4 describes the evaluation results. The last section concludes our
approach and presents the future work.
Chinese bilingual corpus in the specific domain (a user manual for a medical system),
which includes 546 bilingual sentence pairs. From this domain-specific corpus, we
randomly select 180 pairs as testing data. The remained 366 pairs are used as domain-
specific training data.1
The Chinese sentences in both the training set and the testing set are automatically
segmented into words. Thus, there are two kinds of errors for word alignment: one is
the word segmentation error and the other is the alignment error. In Chinese, if a word
is incorrectly segmented, the alignment result is also incorrect. For example, for the
Chinese sentence (Warning label for the couch-top), our
system segments it into The sequence is
incorrectly segmented into (couch/taxi)”, which should be (couch-
top/of)”. Thus, the segmentation errors in Chinese may change the word meaning,
which in turn cause alignment errors.
In order to exclude the effect of the segmentation errors on our alignment results,
we correct the segmentation errors in our testing set. The alignments in the testing set
are manually annotated, which includes 1,478 alignment links.
With the above metrics, we evaluate the three methods on the testing set with
Chinese as the source language and English as the target language. The results are
1
Generally, a user manual only includes several hundred sentences.
Improving Domain-Specific Word Alignment with a General Bilingual Corpus 265
shown in Table 1. It can be seen that although the method “G+S” achieves the best
results among others, it performs just a little better than the method “G”. This
indicates that adding the small-scale domain-specific training sentence pairs into the
general corpus doesn’t greatly improve the alignment performance.
We use A , B and C to represent the set of correct alignment links extracted by the
method “G+S”, the method “G” and the method “S”, respectively. From the
experiments, we get and and get two intersection
sets and Thus, about 14% alignment links of
C are not covered by B . That is to say, although the size of the domain-specific
corpus is very small, it can produce word alignment links that are not covered by the
general corpus. These alignment links usually include domain-specific words.
Moreover, about 13% alignment links of C are not covered by A . This indicates
that, by combining the two corpora, the method “G+S” still cannot detect the domain-
specific alignment links. At the same time, about 49% of alignment links in both A
and B are not covered by the set C .
For example, in the sentence pair in Figure 1, there is a domain-specific word
“multislice”. For this word, both the method “G+S” and “G” produce a wrong
alignment link (multislice, while the method “S” produces a correct word
alignment link (multislice, However, the general word alignment link
(refer to, is detected by both the method “G+S” and the method “G” but not
detected by the method “S”.
Based on the above analysis, it can be seen that it is not effective to directly
combine the bilingual corpus in the general domain and in the specific domain as
training data. However, the correct alignment links extracted by the method “G” and
those extracted by the method “S” are complementary to each other. Thus, we can
develop a method to improve the domain-specific word alignment based on the results
of both the method “G” and the method “S”.
266 H. Wu and H. Wang
Another kind of errors is about the multi-word alignment links2. The IBM
statistical word alignment model only allows one-to-one or more-to-one alignment
links. However, the domain-specific terms are usually aligned to more than one
Chinese word. Thus, the multi-word unit in the corpus cannot be correctly aligned
using this statistical model. For this case, we will use translation dictionaries as guides
to modify some alignment links and get multi-word alignments.
Where, represents the index position of the source word aligned to the
target word in position x. For example, if a Chinese word in position j is connected to
an English word in position i, then If a Chinese word in position j is connected
to English words in positions and then
Based on the two alignment sets, we obtain their intersection set, union set3 and
subtraction set.
Intersection:
Union:
Subtraction:
2
Multi-word alignment links means one or more source words aligned to more than one target
word or vice versa.
3
In this paper, the union operation does not remove the replicated elements. For example, if set
one includes two elements {1,2} and set two includes two elements {1,3}, then the union of
these two sets becomes {1, 1,2,3}.
Improving Domain-Specific Word Alignment with a General Bilingual Corpus 267
Thus, the subtraction set contains two different alignment links for each English
word.
For the specific domain, we use and to represent the word alignment sets
in the two directions. The symbols SF, PF and MF represents the intersection set,
union set and the subtraction set, respectively.
In the translation dictionary the multi-words accounts for 32.89% of the total
words. In the translation dictionary the number of multi-words is small because
the training data are very limited.
4
The thresholds are obtained to ensure the best compromise of alignment precision and recall
on the testing set.
268 H. Wu and H.Wang
With the statistical word alignment models and the translation dictionaries trained on
the corpora in the general domain and the specific domain, we describe the algorithm
to improve the domain-specific word alignment in this section.
Based on the bi-directional word alignment, we define SI as and
UG as The word alignment links in the set SI are very
reliable. Thus, we directly accept them as correct links and add them into the final
alignment set WA . In the set UG , there are two to four different alignment links for
each word. We first examine the dictionary and then to see whether there is at
least one alignment link of this word included in these two dictionaries. If it is
successful, we add the link with the largest probability or the largest log-likelihood
ratio score to the final set WA. Otherwise, we use two heuristic rules to select
alignment links. The detailed algorithm is described in Figure 2.
Figure 3 lists four examples for word alignment adaptation. In example (1), the
phrase “based on” has two different alignment links: one is (based on, and the
other is (based, And in the translation dictionary the phrase “based on”
can be translated into Thus, the link (based on, is finally selected
according to rule a) in Figure 2. In the same way, the link (contrast, in example
(2) is selected with the translation dictionary The link (reconstructed, in
Example (3) is obtained because there are three alignment links selecting it. For the
English word “x-ray” in Example (4), we have two different links in UG . One is (x-
ray, X) and the other is (x-ray, And the single Chinese words and
have no alignment links in the set WA . According to the rule d), we select the link (x-
ray,
Improving Domain-Specific Word Alignment with a General Bilingual Corpus 269
4 Evaluation
In this section, we compare our methods with three other methods. The first method
“Gen+Spec” directly combines the corpus in the general domain and in the specific
domain as training data. The second method “Gen” only uses the corpus in the general
domain as training data. The third method “Spec” only uses the domain-specific
corpus as training data. With these training data, the three methods can get their own
translation dictionaries. However, each of them can only get one translation
dictionary. Thus, only one of the two steps a) and b) in Figure 2 can be applied to
these methods. All of these three methods first get bi-directional statistical word
alignment using the GIZA++ tool, and then use the trained translation dictionary to
improve the statistical word alignment results. The difference between these three
methods and our method is that, for each source word, our method provides four
candidate alignment links while the other three methods only provides two candidate
alignment links. Thus, the steps c) and d) in Figure 2 cannot be applied to these three
methods.
The training data and the testing data are the same as described in Section 2.1.
With the evaluation metrics described in section 2.2, we get the alignment results
shown in Table 3. From the results, it can be seen that our approach performs the best
among others. Our method achieves a 21.96% relative error rate reduction as
compared with the method “Gen+Spec”. In addition, by comparing the results in
Table 3 and those in Table 1 in Section 2.2, we can see that the precision of word
alignment links is improved by using the translation dictionaries. Thus, introducing
translation dictionary results in alignment precision improving while combining the
alignment results of “Gen” and “Spec” results in alignment recall improving.
270 H. Wu and H. Wang
In the testing set, there are 240 multi-word alignment links. Most of the links
consist of domain-specific words. Table 4 shows the results for multi-word alignment.
Our method achieves much higher recall than the other three methods and achieves
comparable precision. This indicates that combining the alignment results created by
the “Gen” method and the “Spec” method increases the possibility of obtaining multi-
word alignment links. From the table, it can be also seen that the “Spec” method
performs better than both the “Gen” method and the “Gen+Spec” method on the
multi-word alignment. This indicates that the “Spec” method can catch domain-
specific alignment links even when trained on the small-scale corpus. It also indicates
that by adding the domain-specific data into the general training data, the method
“Gen+Spec” cannot catch the domain-specific alignment links.
References
1. Ahrenberg, L., Merkel, M., Andersson, M.: A Simple Hybrid Aligner for Generating
Lexical Correspondences in Parallel Tests. In Proc. of the 36th Annual Meeting of the
Association for Computational Linguistics and the 17th Int. Conf. on Computational
Linguistics (ACL/COLING-1998) 29-35
2. Ahrenberg, L., Merkel, M., Hein, A.S., Tiedemann, J.: Evaluation of Word Alignment
Systems. In Proc. of the Second Int. Conf. on Linguistic Resources and Evaluation
(LREC-2000) 1255-1261
3. Brown, P.F., Della Pietra, S., Della Pietra, V., Mercer, R.: The Mathematics of Statistical
Machine Translation: Parameter estimation. Computational Linguistics (1993), Vol. 19,
No. 2,263-311
4. Cherry, C., Lin, D.K.: A Probability Model to Improve Word Alignment. In Proc. of the
41st Annual Meeting of the Association for Computational Linguistics (ACL-2003) 88-95
5. Dunning, T.: Accurate Methods for the Statistics of Surprise and Coincidence.
Computational Linguistics (1993), Vol. 19, No. 1, 61-74
6. Ker, S.J., Chang, J.S.: A Class-based Approach to Word Alignment. Computational
Linguistics (1997), Vol. 23, No. 2, 313-343
7. Langlais, P.: Improving a General-Purpose Statistical Translation Engine by
Terminological Lexicons. In Proc. of the 2nd Int. Workshop on Computational
Terminology (COMPUTERM-2002) 1-7
8. Melamed, D.: Automatic Construction of Clean Broad-Coverage Translation Lexicons. In
Proc. of the 2nd Conf. of the Association for Machine Translation in the Americas
(AMTA-1996) 125-134
9. Menezes, A., Richardson, S.D.: A Best-First Alignment Algorithm for Automatic
Extraction of Transfer Mappings from Bilingual Corpora. In Proc. of the ACL 2001
Workshop on Data-Driven Methods in Machine Translation (2001) 39-46
10. Och, F.J., Ney, H.: Improved Statistical Alignment Models. In Proc. of the 38th Annual
Meeting of the Association for Computational Linguistics (ACL-2000) 440-447
11. Smadja, F., McKeown, K.R., Hatzivassiloglou, V.: Translating Collocations for Bilingual
Lexicons: a Statistical Approach. Computational Linguistics (1996), Vol. 22, No. 1, 1-38
12. Simard, M., Langlais, P.: Sub-sentential Exploitation of Translation Memories. In Proc. of
MT Summit VIII (2001) 335-339
13. Somers, H.: Review Article: Example-Based Machine Translation. Machine Translation
(1999), Vol. 14, No. 2,113-157
14. Tufis, D., Barbu, A.M.: Lexical Token Alignment: Experiments, Results and Application.
In Proc. of the Third Int. Conf. on Language Resources and Evaluation (LREC-2002) 458-
465
15. Wu, D.K.: Stochastic Inversion Transduction Grammars and Bilingual Parsing of Parallel
Corpora. Computational Linguistics (1997), Vol. 23, No. 3, 377-403
A Super-Function Based Japanese-Chinese
Machine Translation System for Business Users
1 Introduction
R.E. Frederking and K.B. Taylor (Eds.): AMTA 2004, LNAI 3265, pp. 272–281, 2004.
© Springer-Verlag Berlin Heidelberg 2004
A Super-Function Based Japanese-Chinese Machine Translation System 273
While the benefits derived from MT can be categorized into many classes,
one important class of real users consists of business users. For them, MT can
provide benefits in a number of ways. If one likes to understand the general
meaning of some document such as e-mail or a business letter, MT can be used
to scan through those texts and return the requested information in a short
amount of time. It can also help users to screen many large documents in order
to identify documents that warrant more accurate translation.
With the increasing business between Japan and China, Japanese-Chinese
business letters become more and more important. Here, we apply our approach
for respective business users. Section 2 discusses the SFBMT approach and re-
lates it to other engines. Sections 3 and 4 describe the user requirements and
provide some experimental results. Finally, some conclusions are given.
2 Super-Function
A Super-Function is a function that shows some defined relations between a
source (original) language sentence and a target language sentence [9,15]. In the
sequel we discuss the SF concept and briefly put it into perspective with respect
to other approaches from the literature.
2.1 SF Definition
A SF can be represented using a formal description:
Fig. 1. Example of DG
sentence pair. That is, a SF consists of the constant parts which are extracted
from the source and the target language sentence pair and the corresponding
positions of the variable parts.
2.2 SF Architecture
A SF may be represented, e.g., by means of a Directed Graph (DG). In a DG
strings are represented by nodes and variables are represented by edges (see
Fig. 1 for the above example SF1). The purpose of introducing the DG is to use
a finite state technique in matching between a SF and a sentence.
Another architecture of a SF is a Transformation TaBle (TTB). It consists
of a Node TaBle (NTB) and an Edge TaBle (ETB). For SF1 the construction
of NTB and ETB is described using Tables 1 and 2. Language-J in NTB is a
string of Japanese, Language-C is a string of Chinese; Location-J and -C in ETB
indicate the location relationship between Japanese and Chinese. Kind in ETB
indicates the kind of variable or condition. At present, we use a TTB to represent
a SF. However, it is easy to transform a DG into a TTB and vice versa.
2.3 Robust SF
At the beginning, we construct some Robust SF (RSF). Constructing RSFs aims
at enhancing the robustness of MT. The following is an example:
The SFBMT uses a bilingual dictionary and the SF concept to produce a trans-
lation of a source text. An input sentence is first analyzed morphologically and
then matched with the source sentence and a SF. SFBMT produces a sentence-
by-sentence translation of the source text, falling back on a phrase-by-phrase
and a word-by-word translation when no SF matches the input sentence. Here,
we just consider the case of sentence-by-sentence translation. The process of
SFBMT consists of the three major parts:
Morphological Analysis: General morphological analysis was developed
as a method for structuring and investigating the total set of relationships con-
tained in multi-dimensional, non-quantifiable, problem complexes [16]. Japanese
text poses challenges for automatic analysis due to the language’s complex con-
structions. Text must be accurately segmented by common application functions
like parsing, indexing, categorizing, or searching. Morphological analysis is often
a prerequisite to performing operations on text of any language.
Super-Function Matching: A SF is represented by a NTB and an ETB.
Matching a SF is simply matching each node of the NTB and confirming the
kind of each edge in the ETB.
Morphological Translation: After obtaining the SF by the SF matching,
the target language variable parts are rearranged, according to the location rela-
tionship in the ETB, into the string parts to get the translated sentence. Usually,
it is suggested that an unknown word is treated as a noun to match a SF, and a
translation is generated using the string of the unknown source language word.
However, for Japanese as a specific language, we define some rules to decide if
an unknown word is a verb or a noun.
ments), and for each common type of business letters, the basic format and the
outline are very similar. Based on the special functionalities and properties of
business letters, in our system we only look at nouns as variables for translation.
The outline of the translation is as follows (see also Fig. 2).
The translation system consists of five major parts:
1. The inputted Japanese sentence is written in a file.
2. Morphological Analysis: The Japanese sentence is morphologically analyzed
by ChaSen (a free Japanese morphological analyzer). ChaSen analyzes the
specified file in the morpheme and outputs the result.
3. Translation Processing: First of all, the nouns are extracted from the file
which resulted after the morphological analysis. Then the words between
nouns are tied together to build a node of a SF. The nouns are written
into a noun file, and the parts of the SF are written to a node file which is
matched with the SF base to search for the corresponding Chinese SF part.
By using the bilingual dictionary the nouns are then translated into Chinese.
4. Morphological Agreement: Based on the order of nouns in an ETB a rear-
rangement of the nouns within the Chinese node parts takes place.
5. A translation sentence is outputted to a browser.
4.1 Example
In the practical experiment, we unite the NTB and the ETB to express the SF.
We use the above example SF1 to explain the detailed translation process.
Japanese: SEIHIN HAOMOTOSHITE DOITSU TO BEIKOKU NIHAN-
BAISARETEORIMASU.
(Perform morphological analysis)
(Tie words between nouns together and construct a Japanese SF part from
them)
(Match the Japanese SF part with the SF base and output the corresponding
Chinese SF part)
(Rearrange the SF part and translated nouns, if necessary)
References
1. R.D. Brown. Adding linguistic knowledge to a lexical example-based translation
system. In Proceedings of the 8th International Conference on Theoretical and
Methodological Issues in Machine Translation, pages 22–32. Chester, 1999.
2. R.D. Brown. Automated generalization of translation examples. In Proceedings of
the 18th International Conference on Computational Linguistics (COLING-2000),
pages 125–131. Saarbrücken, 2000.
3. R. Frederking, D. Grannes, P. Cousseau, and S. Nirenburg. An MAT tool and its
effectiveness. In Proceedings of the DARPA Human Language Technology Work-
shop, Princeton, NJ, 1993.
4. K. Fujimoto, L. Zhang, and H. ShiYun. Chinese-Japanese Business Letters Ency-
clopedia. TOHU Shoten, 1995.
5. J. Hutchins. Machine translation today and tomorrow. In G. Willée, B. Schröder,
and H.-C. Schmitz, editors, Computerlinguistik: Was geht, was kommt?, pages
159–162, Sankt Augustin, 2002. Gardez.
6. http://www.jusnet.co.jp/business/bunrei.html. June 2004 (date of last check).
7. K. Maruyama, M. Doi, Y. Iguchi, K. Kuwabara, M. Onuma, T. Yasui, and R. Yoko-
suka. Writing Business Letters In Japanese. Original Japanese edition published
by The Japan Times, Ltd., 2003.
8. S. Nirenburg, S. Beale, and C. Domashnev. A full-text experiment in example-
based machine translation. In Proceedings of the International Conference on New
Methods in Language Processing, pages 78–87. Manchester, 1994.
9. F. Ren. Super-function based machine translation. Communications of COLIPS,
9(1):83–100, 1999.
10. S. Sato. MBT2: A method for combining fragments of examples in example-based
translation. Artificial Intelligence, 75:31–49, 1995.
11. K. Takeda. Pattern-based context-free grammars for machine translation. In Pro-
ceedings of the 34th Annual Meeting of the Association for Computational Linguis-
tics, pages 144–151, 1996.
12. T. Veale and A. Way. Gaijin: A bootstrapping, template-driven approach to
example-based MT. In Proceedings of the NeMNLP ’97 New Methods in Natu-
ral Language Processing. Sofia, 1997.
13. H. Watanabe. A method for extracting translation patterns from translation ex-
amples. In Proceedings TMI-93, pages 292–301, 1993.
14. R. Zajac and M. Vanni. Glossary-based MT engines in a multilingual analyst’s
workstation architecture. Machine Translation, 12:131–151, 1997.
15. X. Zhao, F. Ren, S. Kuroiwa, and M. Sasayama. Japanese-Chinese machine trans-
lation system using SFBMT. In Proceedings of the Second International Conference
on Information, pages 16–21, Beijing, 2002. Tsinghua University.
16. F. Zwicky. Discovery, Invention, Research – Through the Morphological Approach.
Macmillan, Toronto, 1969.
This page intentionally left blank
Author Index
Vol. 3265: R.E. Frederking, K.B. Taylor (Eds.), Machine Vol. 3171: A.L.C. Bazzan, S. Labidi (Eds.), Advances in
Translation: From Real Users to Research. XI, 283 pages. Artificial Intelligence – SBIA 2004. XVII, 548 pages.
2004. 2004.
Vol. 3249: B. Buchberger, J.A. Campbell (Eds.), Artificial Vol. 3159: U. Visser, Intelligent Information Integration
Intelligence and Symbolic Computation. X, 285 pages. for the Semantic Web. XIV, 150 pages. 2004.
2004.
Vol. 3157: C. Zhang, H. W. Guesgen, W.K. Yeap (Eds.),
Vol. 3245: E. Suzuki, S. Arikawa (Eds.), Discovery Sci- PRICAI 2004: Trends in Artificial Intelligence. XX, 1023
ence. XIV, 430 pages. 2004. pages. 2004.
Vol. 3244: S. Ben-David, J. Case, A. Maruoka (Eds.), Al- Vol. 3155: P. Funk, PA. González Calero (Eds.),Advances
gorithmic Learning Theory. XIV, 505 pages. 2004. in Case-Based Reasoning. XIII, 822 pages. 2004.
Vol. 3238: S. Biundo, T. Frühwirth, G. Palm (Eds.), KI Vol. 3139: F. Iida, R. Pfeifer, L. Steels, Y. Kuniyoshi (Eds.),
2004: Advances in Artificial Intelligence. XI, 467 pages. Embodied Artificial Intelligence. IX, 331 pages. 2004.
2004.
Vol. 3131: V. Torra, Y. Narukawa (Eds.), Modeling Deci-
Vol. 3229: J.J. Alferes, J. Leite (Eds.), Logics in Artificial sions for Artificial Intelligence. XI, 327 pages. 2004.
Intelligence. XIV, 744 pages. 2004.
Vol. 3127: K.E. Wolff, H.D. Pfeiffer, H.S. Delugach(Eds.),
Vol. 3215: M.G.. Negoita, R.J. Howlett, L.C. Jain (Eds.), Conceptual Structures at Work. XI, 403 pages. 2004.
Knowledge-Based Intelligent Information and Engineer- Vol. 3123: A. Belz, R. Evans, P. Piwek (Eds.), Natural
ing Systems. LVII, 906 pages. 2004.
Language Generation. X, 219 pages. 2004.
Vol. 3214: M.G.. Negoita, RJ. Howlett, L.C. Jain (Eds.), Vol. 3120: J. Shawe-Taylor, Y Singer (Eds.), Learning
Knowledge-Based Intelligent Information and Engineer- Theory. X, 648 pages. 2004.
ing Systems. LVIII, 1302 pages. 2004.
Vol. 3097: D. Basin, M. Rusinowitch (Eds.), Automated
Vol. 3213: M.G.. Negoita, R.J. Howlett, L.C. Jain (Eds.), Reasoning. XII, 493 pages. 2004.
Knowledge-Based Intelligent Information and Engineer-
ing Systems. LVIII, 1280 pages. 2004. Vol. 3071: A. Omicini, P. Petta, J. Pitt (Eds.), Engineering
Societies in the Agents World. XIII, 409 pages. 2004.
Vol. 3209: B. Berendt, A. Hotho, D. Mladenic, M. van
Someren, M. Spiliopoulou, G. Stumme (Eds.), Web Min- Vol. 3070: L. Rutkowski, J. Siekmann, R. Tadeusiewicz,
ing: From Web to Semantic Web. IX, 201 pages. 2004. L.A. Zadeh (Eds.), Artificial Intelligence and Soft Com-
puting - ICAISC 2004. XXV, 1208 pages. 2004.
Vol. 3206: P. Sojka, I. Kopecek, K. Pala (Eds.), Text,
Speech and Dialogue. XIII, 667 pages. 2004. Vol. 3068: E. André, L. Dybkjær, W. Minker, P. Heis-
terkamp (Eds.), Affective Dialogue Systems. XII, 324
Vol. 3202: J.-F. Boulicaut, F. Esposito, F. Giannotti, D. pages. 2004.
Pedreschi (Eds.), Knowledge Discovery in Databases:
PKDD 2004. XIX, 560 pages. 2004. Vol. 3067: M. Dastani, J. Dix, A. El Fallah-Seghrouchni
(Eds.), Programming Multi-Agent Systems. X, 221 pages.
Vol. 3201: J.-F. Boulicaut, F. Esposito, F. Giannotti, D. 2004.
Pedreschi (Eds.), Machine Learning: ECML 2004. XVIII,
580 pages. 2004. Vol. 3066: S. Tsumoto, J. Komorowski,
Rough Sets and Current Trends
Vol. 3194: R. Camacho, R. King, A. Srinivasan (Eds.), in Computing. XX, 853 pages. 2004.
Inductive Logic Programming. XI, 361 pages. 2004.
Vol. 3065: A. Lomuscio, D. Nute (Eds.), Deontic Logic in
Vol. 3192: C. Bussler, D. Fensel (Eds.), Artificial Intel- Computer Science. X, 275 pages. 2004.
ligence: Methodology, Systems, and Applications. XIII,
522 pages. 2004. Vol. 3060: A.Y. Tawfik, S.D. Goodwin (Eds.), Advances
in Artificial Intelligence. XIII, 582 pages. 2004.
Vol. 3191: M. Klusch, S. Ossowski, V. Kashyap, R. Un-
land (Eds.), Cooperative Information Agents VIII. XI, 303 Vol. 3056: H. Dai, R. Srikant, C. Zhang (Eds.), Advances in
pages. 2004. Knowledge Discovery and Data Mining. XIX, 713 pages.
2004.
Vol. 3187: G. Lindemann, J. Denzinger, I.J. Timm, R. Un-
land (Eds.), Multiagent System Technologies. XIII, 341 Vol. 3055: H. Christiansen, M.-S. Hacid, T. Andreasen,
pages. 2004. H.L. Larsen (Eds.), Flexible Query Answering Systems.
X, 500 pages. 2004.
Vol. 3176:O. Bousquet, U. von Luxburg, G. Rätsch (Eds.),
Advanced Lectures on Machine Learning. IX, 241 pages. Vol. 3040: R. Conejo, M. Urretavizcaya, J.-L. Pérez-de-
2004. la-Cruz (Eds.), Current Topics in Artificial Intelligence.
XIV, 689 pages. 2004.
Vol. 3035: M.A. Wimmer (Ed.), Knowledge Management Vol. 2892: F. Dau, The Logic System of Concept Graphs
in Electronic Government. XII, 326 pages. 2004. with Negation. XI, 213 pages. 2003.
Vol. 3034: J. Favela, E. Menasalvas, E. Chávez (Eds.), Vol. 2891: J. Lee, M. Barley (Eds.), Intelligent Agents and
Advances in Web Intelligence. XIII, 227 pages. 2004. Multi-Agent Systems. X, 215 pages. 2003.
Vol. 3030: P. Giorgini, B. Henderson-Sellers, M. Winikoff Vol. 2882: D. Veit, Matchmaking in Electronic Markets.
(Eds.), Agent-Oriented Information Systems. XIV, 207 XV, 180 pages. 2003.
pages. 2004. Vol. 2871: N. Zhong, S. Tsumoto, E. Suzuki
Vol. 3029: B. Orchard, C. Yang, M. Ali (Eds.), Innovations (Eds.), Foundations of Intelligent Systems. XV, 697 pages.
in Applied Artificial Intelligence. XXI, 1272 pages. 2004. 2003.
Vol. 3025: G.A. Vouros, T. Panayiotopoulos (Eds.), Meth- Vol. 2854: J. Hoffmann, Utilizing Problem Structure in
ods and Applications of Artificial Intelligence. XV, 546 Planing. XIII, 251 pages. 2003.
pages. 2004. Vol. 2843: G. Grieser, Y. Tanaka, A. Yamamoto (Eds.),
Vol. 3020: D. Polani, B. Browning, A. Bonarini, K. Discovery Science. XII, 504 pages. 2003.
Yoshida (Eds.), RoboCup 2003: Robot Soccer World Cup Vol. 2842: R. Gavaldá, K.P. Jantke, E. Takimoto (Eds.),
VII. XVI, 767 pages. 2004. Algorithmic Learning Theory. XI, 313 pages. 2003.
Vol. 3012: K. Kurumatani, S.-H. Chen, A. Ohuchi (Eds.), Vol. 2838: D. Gamberger, L. Todorovski,
Multi-Agents for Mass User Support. X, 217 pages. 2004. H. Blockeel (Eds.), Knowledge Discovery in Databases:
Vol. 3010: K.R. Apt, F. Fages, F. Rossi, P. Szeredi, J. PKDD 2003. XVI, 508 pages. 2003.
Váncza (Eds.), Recent Advances in Constraints. VIII, 285 Vol. 2837: D. Gamberger, L. Todorovski, H.
pages. 2004. Blockeel (Eds.), Machine Learning: ECML 2003. XVI,
Vol. 2990: J. Leite, A. Omicini, L. Sterling, P. Torroni 504 pages. 2003.
(Eds.), Declarative Agent Languages and Technologies.
Vol. 2835: T. Horváth, A. Yamamoto (Eds.), Inductive
XII, 281 pages. 2004. Logic Programming. X, 401 pages. 2003.
Vol. 2980: A. Blackwell, K. Marriott,A. Shimojima (Eds.), Vol. 2821: A. Günter, R. Kruse, B. Neumann (Eds.), KI
Diagrammatic Representation and Inference. XV, 448 2003: Advances in Artificial Intelligence. XII, 662 pages.
pages. 2004. 2003.
Vol. 2977: G. Di Marzo Serugendo, A. Karageorgos, O.F.
Vol. 2807: P. Mautner (Eds.), Text, Speech
Rana, F. Zambonelli (Eds.), Engineering Self-Organising
and Dialogue. XIII, 426 pages. 2003.
Systems. X, 299 pages. 2004.
Vol. 2801: W. Banzhaf, J. Ziegler, T. Christaller, P. Dittrich,
Vol. 2972: R. Monroy, G. Arroyo-Figueroa, L.E. Sucar, H.
J.T. Kim (Eds.), Advances in Artificial Life. XVI, 905
Sossa (Eds.), MICAI 2004: Advances in Artificial Intelli-
pages. 2003.
gence. XVII, 923 pages. 2004.
Vol. 2797: O.R. Zaïane, S.J. Simoff, C. Djeraba (Eds.),
Vol. 2969: M. Nickles, M. Rovatsos, G. Weiss (Eds.),
Mining Multimedia and Complex Data. XII, 281 pages.
Agents and Computational Autonomy. X, 275 pages.
2003.
2004.
Vol. 2792: T. Rist, R.S. Aylett, D. Ballin, J. Rickel (Eds.),
Vol. 2961: P. Eklund (Ed.), Concept Lattices. IX, 411
Intelligent Virtual Agents. XV, 364 pages. 2003.
pages. 2004.
Vol. 2782: M. Klusch, A. Omicini, S. Ossowski, H. Laa-
Vol. 2953: K. Konrad, Model Generation for Natural Lan- manen (Eds.), Cooperative Information Agents VII. XI,
guage Interpretation and Analysis. XIII, 166 pages. 2004. 345 pages. 2003.
Vol. 2934: G. Lindemann, D. Moldt, M. Paolucci (Eds.),
Vol. 2780: M. Dojat, E. Keravnou, P. Barahona (Eds.),
Regulated Agent-Based Social Systems. X, 301 pages. Artificial Intelligence in Medicine. XIII, 388 pages. 2003.
2004.
Vol. 2777: B. Schölkopf, M.K. Warmuth (Eds.), Learning
Vol. 2930: F. Winkler (Ed.), Automated Deduction in Ge- Theory and Kernel Machines. XIV, 746 pages. 2003.
ometry. VII, 231 pages. 2004.
Vol. 2752: G.A. Kaminka, P.U. Lima, R. Rojas (Eds.),
Vol. 2926: L. van Elst, V. Dignum, A. Abecker (Eds.),
RoboCup 2002: Robot Soccer World Cup VI. XVI, 498
Agent-Mediated Knowledge Management. XI, 428 pages. pages. 2003.
2004.
Vol. 2741: F. Baader (Ed.), Automated Deduction –
Vol. 2923: V. Lifschitz, I. Niemelä (Eds.), Logic Program- CADE-19. XII, 503 pages. 2003.
ming and Nonmonotonic Reasoning. IX, 365 pages. 2004.
Vol. 2705: S. Renals, G. Grefenstette (Eds.), Text- and
Vol. 2915: A. Camurri, G. Volpe (Eds.), Gesture-Based Speech-Triggered Information Access. VII, 197 pages.
Communication in Human-Computer Interaction. XIII, 2003.
558 pages. 2004.
Vol. 2703: O.R. Zaïane, J. Srivastava, M. Spiliopoulou, B.
Vol. 2913: T.M. Pinkston,V.K. Prasanna (Eds.), High Per-
Masand (Eds.), WEBKDD 2002 - MiningWeb Data for
formance Computing - HiPC 2003. XX, 512 pages. 2003.
Discovering Usage Patterns and Profiles. IX, 181 pages.
Vol. 2903: T.D. Gedeon, L.C.C. Fung (Eds.), AI 2003: Ad- 2003.
vances in Artificial Intelligence. XVI, 1075 pages. 2003.
Vol. 2700: M.T. Pazienza (Ed.), Extraction in the Web Era.
Vol. 2902: F.M. Pires, S.P. Abreu (Eds.), Progress in Arti- XIII, 163 pages. 2003.
ficial Intelligence. XV, 504 pages. 2003.