Book
Book
Book
M. Kugler K. Ahmad
G. Thurmair (Eds.)
Translator's Workbench
Tools and Terminology for
Translation and Text Processing
Springer
Volume Editors
Marianne Kugler
Philips Kommunikations-Industrie AG
Thurn-und-Taxisstr. 10, D-90327 NOrnberg, Germany
Khurshid Ahmad
AI Group, Department of Mathematical and Computing Sciences
University of Surrey, Guildford, Surrey GU2 5XH, UK
Gregor Thurmair
Sietec Systemtechnik
Carl-Wery-Str. 22, D-81739 MOnchen, Germany
ISBN-13: 978-3-540-57645-7
DOl: 10.1007/978-3-642-78784-3
e-ISBN-13: 978-3-642-78784-3
SPIN: 10132150
Preface
The Translator's Workbench Project was a European Community sponsored research and
development project which dealt with issues in multi-lingual communication and documentation.
This book presents an integrated toolset as a solution to problems in translation and documentation. Professional translators and teachers of translation were involved in the process of software development, starting with a detailed study of the user requirements and
ending with several evaluation-and-improvement cycles of the resulting toolset.
English, German, Greek, and Spanish are addressed in the contributions, however, some
of the techniques are inherently language-independent and can thus be extended to cover
other languages as well.
Translation can be viewed broadly as the execution of three cognitive processes, and this
book has been structured along these lines:
First, the translation pre-process, understanding the target language text at a lexicosemantic level on the one hand, and making sense of the source language document on
the other hand. The tools for the pre-translation process include access to electronic
networks, conversion of documents from one format to another, creation of terminology data banks and access to existing data banks, and terminology dictionaries.
Second, the translation process, rendering sentences in the source language into equivalent target sentences. The translation process refers to the potential of conventional
machine translation systems, like METAL, and of the statistically oriented translation
memory.
Third, the translation post-processes, making the target language readable at the lexical,
syntactical and semantic level. The translation post-processes include the discussion of
computer-based solutions to proof-reading, spelling checkers, and grammar checkers in
English, German, Greek, and Spanish.
The Translator's Workbench comprises tools for making these three cognitive processes
easier to execute for the translator and gives the state of the art in which translation and
text processing tools are available or feasible.
The Translator's Workbench Project with its interdisciplinary approach was a demonstration of the techniques and tools necessary for machine-assisted translation and documentation.
Marianne Kugler, Khurshid Ahmad, Gregor Thurmair
December 1994
List of Authors
Ahmad, Khurshid, University of Surrey
Albl, Michaela, University of Heidelberg
Davies, Andrea, University of Surrey
Delgado, Jaime, UPC Barcelona
Dudda, Friedrich, TA Triumph Adler AG
Fulford, Heather, University of Surrey
Hellwig, Peter, University of Heidelberg
Heyer, Gerhard, TA Triumph Adler AG
Hoge, Monika, Mercedes Benz AG
Hohmann, Andrea, Mercedes Benz AG
Holmes-Higgin, Paul, University of Surrey
Hoof, Toon van, Fraunhofer Gesellschaft IAT
Karamanlakis, Stratis, L-Cube Athens
Keck, Bernhard, Fraunhofer Gesellschaft IAT
Kese, Ralf, TA Triumph Adler AG
Kleist-Retzow, Beate von, TA Triumph Adler AG
Kohn, Kurt, University of Heidelberg
Kugler, Marianne, TA Triumph Adler AG
Le Hong, Khai, Mercedes Benz AG
Mayer, Renate, Fraunhofer GeseJIschaft IAT
Menzel, Cornelia, Fraunhofer GeseJIschaft IAT
Meya Montserrat, Siemens Nixdorf CDS Barcelona
Rogers, Margaret, University of Surrey
Stallwitz, Gabriele, TA Triumph Adler AG
Thurmair, Gregor, Siemens Nixdorf Munich
Waldhor, Klemens, TA Triumph Adler AG
Winkelmann, GUnter, TA Triumph Adler AG
Zavras, AIexios, L-Cube
Table of Contents
I. Introduction - Multilingual Documentation and Communication ............................ 1
1. Introduction ................................................................................................................... 3
2. Key Players ................................................................................................................... 4
3. The Cognitive Basis of Translation .............................................................................. 6
4. User Participation in Software Development ............................................................... 8
4.1
User Requirements Study ............................................................................ 9
4.2
Software Testing and Evaluation Integrating the User into the Software Development Process .................. 14
5. TWB in the Documentation Context .......................................................................... 16
5.1
The Context of Translation Tools .............................................................. 16
5.2
Text Control. .............................................................................................. 18
5.3
Translation Preparation Tools ................................................................... 20
5.4
Translation Tools ....................................................................................... 22
5.5
Post-Translation Tools ............................................................................... 23
6. Market Trends for Text Processing Tools ................................................................... 24
viii
Table of contents
10.6
10.7
10.8
10.9
..................................................... 73
117
117
121
123
Table of Contents
17.3
17.4
17.5
17.6
ix
I. Introduction
Multilingual Documentation and Communication
1.
Introduction
Gerhard Heyer
Text processing has always been one of the main productivity tools. As the market is
expected to approach saturation, however, two main consequences need to be taken into
consideration. On the one hand, we can expect a shift of interest towards value-adding text
processing tools that give the user an operational advantage with respect to, e.g. translation support, multi-media integration, text retrieval, and groupware. On the other hand,
any such extensions will gain widest acceptance if they can be used as complements to or
from within the most widespread text processing systems.
The project has tried to take both considerations into account by developing TWB as a set
of modular tools that can equally be integrated into Frame Maker under UNIX, Word or
other text processing packages for Windows (capable to communicate via DDE).
Three scenarios are expected as real life applications:
(1) Writing a text or memo in a language that is not one's native language. In a Europe that
is more and more growing together, increasingly texts will have to be exchanged that are
written in English, French, or German. For many colleagues in international organisations
or corporations, this is reality already today. Requirements for this kind of application are
in particular language checking tools for non-native writers of a language, preferably
based on a comparative grammar.
(2) Translating a text from source language into target language. In this scenario, it is
mainly professional translators who are concerned, but also authors in all areas of text production occasionally have to translate already existing documents into other languages. In
a professional environment, translation frequently is part of a technical documentation
process. Hence, requirements for this kind of application comprise powerful editors, text
converters, document management, terminology administration, and translation support
tools, including fully automatic translation.
(3) Reviewing a text and its translation. Once a text has been translated, either the translator himself or the person for whom he has translated the text might wish to review the
translation with respect to the source text. In many cases, this will just pertain to some key
terms or paragraphs. The main requirement for this kind of application, therefore, is parallel scrolling, or alignment.
TRANSLATOR'S WORKBENCH (TWB) as a pre-competitive demonstrator project covers all three scenarios, and is intended to deliver up to date and competitive results that
adhere to international standards, and which can serve as a basis for future developments
in the area of advanced text processing and machine aided translation.
2. Key Players
Monika Hoge. Andrea Hohmann. Khai Le-Hong
document creators on the one hand and document users on the other. However, due to the
recent growth of multilingual communication and documentation, very often both document creators and document users are confronted with the problem of producing and
understanding foreign language texts. The need to support all three groups - document creators, translators, document users - with adequate tools is obvious, the marketplace ever
increasing.
One can distinguish between two major potential user groups of language tools, i.e. professional users and occasional users. Professional users include in-house translators, translators in translation agencies, freelance translators, and interpreters. There is an even wider
range of occasional users, i.e. commercial correspondents, managers, executives, technical
writers, secretaries, typists, research staff, and publishing shops. Different users have different requirements, have to perform different tasks and consequently need different tools.
Figure I gives an outline of the different user groups and the language-related tasks they
have to perform.
occasional users
professional users
~
group
task
typing texts
diff. languages
understanding
source texts
S i 8~ fie
~!i ;;l'g
i.8'"
t
.5 ~ ~~
.5
6",
II)
1S
writin[ktarget!
X
FLtex
LSPterms
knowll,dge
extracUon
terminology
elaboration
ttanslating
(intellectual)
checking
texts
looking-up
LGPtelUlS
!OQ~g-up
'"
.~ :l
MlIl
~~
X
11
..=
G~
l~ e~~'"
I"!
~fl
c~
i~8-8Q.~
X
X
2. Key Players
occasional users
professional users
I~
group
tools
?Ai .~w
.saY
,8 "iii
"iil
.s! !~
II i u1 iY~ 1~ '1
w
editor
LSP
termbanks
translabon
tools
machine
translation
text
comparison
converters
dictionaries
tenruno~ogy
eiaborabon
tool
.j ~t
.5
language
checker
on-line
8)
~~
!;~
=s
~.~
.e
;!s,
t~
X
X
X
X
sary for securing a satisfactory match with the available 'top down' infonnation,
translating requires their presence throughout the entire process. The words and structures
of the source text do not fade away with successful comprehension; in fact, they are kept
alive and are needed for continual checks. This means that in the course of the production
of the target text, translators have to create appropriate expressions while, at the same
time, being forced to focus their attention on the source text. Thus, "old" textual clues relevant for the comprehension task are getting in the way of "new" ones necessary for production. This interference conflict between source and target text accounts for
"translationese", which is one of the most persistent translation problems.
The second type of conflict derives from the translator's lack of communicative autonomy.
Translators nonnally are more or less limited in the semantic control they are allowed to
exercise. Unlike the ordinary unilingual producer of texts, they are not semantically autonomous. The meaning for which the translator has to find the appropriate linguistic expression is essentially detennined by the source text. Translating conditions are, thus, in
conflict with the meaning creating function of text production, and seriously inhibit the
translators' intuitive retrieval of their linguistic knowledge; again, translationese is likely
to be the unwelcome result.
Underlying this model of translating as a text-based cognitive activity is the claim that the
strategies and procedural conditions of translational text processing are basically the same
for translating general and special texts. This does not mean, however, that there are no
differences between the two. On the contrary, depending on the kinds of factual and linguistic knowledge involved, and of the conventions controlling the adequacy of textual
expressions, texts of different types create specific demands on translational processing.
In the case of special texts, more often than not, the common problems resulting from
translation-specific processing constraints are magnified by the difficulties translators have
with successfully keep abreast of the explosive knowledge development in specialized
subject areas, and of the corresponding changes in tenninological, phraseological and stylistic standards and conventions. Technical translators are not only faced with the "ordinary" problems of translation; they also have to fight a constant battle on the knowledge
front.
The solutions provided by TWB account for a whole variety of needs arising in the context of multilingual documentation and translation: target text checking with respect to
spelling mistakes, grammatical errors and stylistic inadequacies; the aquisition, representation and translational deployment of tenninological knowledge; the encyclopaedic and
transfer-specific elaboration of tenns; a dynamically evolving translation memory. In the
face of a fast moving multilingual infonnation market, TWB is thus geared to support
translators to achieve better and more reliable translation results and to increase their overall translational productivity.
4.
Khurshid Ahmad, Andrea Davies. Heather Fulford, Paul Holmes-Higgin. Margaret Rogers.
MonikJl Hoge
In the last 40 years, translation has established itself as a "distinct and autonomous profession", requiring "specialised knowledge and long and intensive academic preparation"
(Newmark 1991:45). Even in the United Kingdom, for example, where industry and commerce often rely on the status of English as a lingua franca, the profession is now consolidating its position.
ConsequencesINeeds
Challenges
- there is a growing awareness of product quality, involving extensive translation quality
assurance; the target text has to meet the
pragmatic, stylistic and terminological
demands of userlreader/consumer.
->
- owing to the harmonization of the Single
European Market in 1993 and the resulting
problem of product liability. the accuracy
and clarity of translations is gaining in
importance.
->
->
->
->
->
->
The Institute of Linguists has 6,000 members of whom nearly 40% belong to the Translating Division and the Interpreting Division (1991). The more recently-established Institute
of Translating and Interpreting has nearly 1500 members (1991). Both organisations offer
examinations in translation. In the 1WB project, the professional status and expertise of
translators has been explicitly acknowledged by their close involvement in all stages of
software development through the integration of an IT user organisation, i.e. MercedesBenzAG.
Mercedes-Benz translators have been involved throughout the software design and development life-cycle, initially in a user requirements study, and subsequently in software
evaluation during the project. The project consortium recognized the importance of eliciting the translators' needs before developing tools to support them in their daily work. For
this reason, at the beginning of the TWB project a user requirements study was conducted
by Mercedes-Benz AG and the University of Surrey. The purpose of this study was to
elicit, through a questionnaire-based survey, observation study, and in-depth interviews,
the current working practice of professional translators, and to ascertain how they might
benefit from some of the recent advances in information technology. This study is perhaps
one of the first systematic attempts to investigate the translation world from these two perspectives. The study methodology itself was innovative, drawing on and synthesizing
techniques employed in psychology, software engineering, and knowledge acquisition for
expert system development.
Software development and its implementation precipitates change in the user organisation
which leads to an inevitable change in the original user requirements. This had to be
incorporated in our work as we view software development as an evolving process, where
history is as important as future development; and to this end, translators were consulted
throughout the project and have taken part in software testing and evaluation (see in detail
Chap.20).
The integration of the end users into the project development work has necessitated careful change management, with a recognition of the need to introduce new technology without disturbing unduly the established working practice of translators. The user
requirements study substantiated the theoretical literature on translation (see for example
Newmark 1988 and 1991) by confirming that translation is a complex cognitive task
involving a number of seemingly disparate high-level skills. These skills could perhaps be
grouped under the broad heading of multilingual knowledge communication skills, and
involve essentially an ability to transfer the meaning and intention of a message from one
language to another in a manner and style appropriate to the document type. The skills fall
into the following categories: linguistic (e.g. language knowledge, ability to write coherently and cohesively), general communication (e.g. sensitivity to and understanding of the
styles of various documents, knowing where to locate relevant domain information, knowing how to conduct library searches), and practical abilities (e.g. word processing, typing,
use of a fax machine, familiarity with electronic mail, as well as miscellaneous administrative tasks.)
10
adopted for the user requirements study and some of its principal findings are presented.
Finally. the relevance of these results to TWB software development is discussed.
BO
60
40
20
o
20-29
30~9
40-49
Ae. Group
>50
thu 1......10 Yu
S..
No
Tn ... btio I>
Q.olifi... tiol>
Subj t, Tnl>Jbt.4
Fig. 4: Translator profile according to: age. sex. translation qualification, and principal subjects
translated
11
Figure 3 illustrates some of the major challenges facing the translation industry of the
future, and indicates the technological facilities to which translators should have access if
these challenges(identified by the Language Services Department at Mercedes-Benz AG)
are to be adequately faced.
Questionnaire Survey
The questionnaire survey was based on a "task" model of translation, i.e. the translation of
documents was viewed as a combination of tasks namely: the task of receiving the documents to be translated (input); the task of translating the document (processing) and the
task of delivering the document (output). The task model provided clues to what it is the
translator generally does and what particular needs the individual translator has, permitting the formation of a user profile. The task model and the user profile formed the basis
of our study and enabled us to establish a clear picture of the translator and his or her
working environment.
Observation Study
The observation study was conducted by the University of Surrey in the Mercedes-Benz
AG translation department. Six translators were observed at work (without disrupting
their normal work routine, although some did stop work to discuss various aspects of
translation practice). Our primary aim was to gain an overall impression of the translation
process and, in so doing, to identify some of the problems translators encounter in their
daily work.
An additional aim of the observation study was to watch translators in various phases of
their work, i.e. the pre-translation edit, translation phase and post-translation edit phase, to
gain insight into the procedure adopted by translators in their work and the phases in
which various tools, such as terminology aids, are used.
In-depth interviews
In-depth interviews were conducted with some translators in the UK and in the translation
department at Mercedes-Benz. The objective of this part of our study was to gain more
detailed information about translation practice and translators' requirements by allowing
translators to discuss their work and the needs they have. This phase of our study was particularly useful for discussing issues which were not really suitable for presentation in the
questionnaire, such as the layout of the user interface.
Some of the techniques of knowledge engineering were employed for the interviews with
translators. Two forms of interviews were conducted: focussed and structured interviews.
Focussed interviews provided the translators with the opportunity to discuss their work
12
and their requirements freely; structured interviews enabled the interviewer to pose specific questions to the translators about current working practice and requirements.
The principal topics covered in interviews included translators' working methods, terminology requirements, the use of computer checking tools, and the layout of the user interface. In the discussions on the layout of the user interface, a series of storyboards was used
to enable translators to visualise screen layout options, and so on.
4.1.3 User Requirements Study: Principal Findings
People: Figure 4 indicates that the typical translator in our survey is young, female, has a
university qualification, and translates technical texts.
"Inputs" and "Outputs": the major input media are non-digital, whereas the principal output media are digital. This indicates that translators make use of digital technology (Le.
computers), but may be constrained by their client's use of non-digital input media.
Processing Requirements: We investigated processing requirements for (i) terminology,
(ii) word processing, (iii) spelling checking, (iv) grammar checking, and (v) style checking. The spectrum of user needs for processing tools ranged from simple look-up lexical
items to more complex semantic processing. Our respondents were only aware of computer-based facilities for items (i), (ii) and (iii) above. For text processing, word processors
are used by a substantial proportion of translators. The use of dictating machines was
found to be rare in our sample. Most translators in our sample (over 40%) checked spelling off-line, approximately 30% manually on screen, and just over 25% using spell check
programs. Less than 1% of the translators in our sample had any direct experience with
term banks or with machine translation or machine-aided translation.
The terminology aids most commonly used by the translators in the survey sample were
paper-based dictionaries, glossaries and word lists, although doubts were expressed about
their accuracy and currency. Translators in our sample organised their terminology bilingually and alphabetically in a card index, including less information than they expect from
other sources. Only very few translators organised their terminology systematically (e.g.
according to a library classification), but many stated that this could be useful. Grouped in
order of priority, requirements for terminological information were found to be:
foreign language equivalent, synonym, variant, abbreviation;
definition, contextual example, usage information;
date, source and terminologist's name;
grammatical information
Our survey indicated that terminological information is required throughout the translation
process: during pre-translation editing to clarify the source language terms; in the translation process itself to identify and use target language terms; and during the post-translation
edit phase for checking the accuracy of target language terms.
13
Obsenation Study
The chronology of translation tasks was shown to be: read source language text; mark
unfamiliar terminology; consult reference works; translate text; edit translation. In addition, reference works continued to be consulted throughout the translation process to clarify further terminological problems. Hence, our study indicates that the translation process
cannot be discretely divided into the three phases (pre-translation edit, translation phase
and post-translation edit) commonly assumed in translation theory and in the teaching of
translation.
The principal difficulty identified by translators throughout their work was the inadequacy
of currently available reference material in terms of currency, degree of domain specialisation, range of linguistic detail.
Interview Study
tional aids would be welcome in the translation environment These aids should support
the translator throughout the translation process, and include tools to assist in terminology
elicitation, term bank development, terminology retrieval, multilingual text processing,
the provision of machine-produced raw translations, the identification of previous translations, and spelling checking.
Based on the results of the user requirements study, the Translator's Workbench is providing the following:
14
4.2 Software Testing and Evaluation - Integrating the User into the
Software Development Process
This section considers some of the problems involved in software testing and evaluation
during development in the particular context of 1WB and presents the solutions which
were agreed on.
Integration Considerations
In trying to integrate the end user into the software development process, we had to
address the issue of communication difficulties between software developers and users.
Owing to their general lack of computing expertise, it is difficult for the users to formulate
and articulate their specific requirements in a way that the software developers can comprehend and act upon. Likewise, the software developers typically find it difficult to grasp
the real issues involved in the day-to-day tasks of a translation environment. In order to
bridge the gap between these two groups, a team was formed from members of the consortium (The User Requirements and Interface Group - URI). This group, made up of users,
developers, and linguists, has been responsible for considering the points raised by users in
the software testing and evaluation, and for communicating these points to the software
developers.
Organisational Considerations
Software acceptance tests are usually conducted upon completion of the implementation
phase, but in 1WB, software testing took place with active user participation throughout
the software development lifecycle. Adopting this approach to testing improved the
chances of detecting major deficiencies in the software and of determining any significant
discrepancies in the users' expectations of the software and the software performance or
functionality. The 1WB software development and evaluation work has therefore been a
dynamic process. The approach has, however, put great demands on the software developers because they have been compelled to deliver testable software at a very early stage in
the project. This was particularly difficult within the scope of 1WB, since it is a pre-competitive research and development project. Nevertheless, three scenario tests have been
successfully carried out on separate prototype versions of the 1WB software; and further-
15
more, long-term tests have been conducted, in which a member of the testing team has
inspected the functionality and performance of the software over a longer period of time.
The general testing procedure developed by the URI group follows a nine-step approach
in which developers and users participate equally. This approach is illustrated in Figure 5
below.
If the above testing procedure is followed, the software can be improved at any point in
the development cycle, and in turn, the improvements can be monitored by the testing
team. This has the advantage that the user is motivated by the fact that the problems identified during testing are generally solved before the next phase of tests is conducted.
Methodological Considerations
Eliciting user requirements at the beginning of the software development process is only
the first stage of user participation. The later, and arguably more complex, stage of software testing and evaluation involves the assessment of how far translators' actual requirements have been met. A particular difficulty here is that users' requirements often change
once the tools which have been designed to meet specified needs are in operation. Hence,
the testing and evaluation of final software qUality involves more than a simple check
against pre-defined requirements. The notion of quality from the user's point of view
must be defined and methods elaborated for the metrication of this quality by means of
acceptance tests. (For details see Chap. 20)
s.
Gregor Thurmair
17
a
c
c
Import 'export
I
n
9
r
c
h
I
y
I
n
9
18
projects for "multilingual offices" which tackle problems of language use in the office
environment. These projects could serve as a basis to develop tools particularly designed
for translators and technical authors such as accounting, delta version management, etc.
Some of the tools needed have been studied in more detail in Translator's Workbench
(1WB).
a lexicon-based speller for Greek has been developed, based on conventional techniques of lexicon lookup and similarity measuring. The challenge here was the language: Greek is a highly inflectional language with many irregularities and word stem
changes, therefore the lexicon lookup needs to be organised in a sophisticated way.
in order to create spellers with more linguistic intelligence, a Spanish spelling module
has been developed which operates on the syllabic structures of Spanish. It turns out
that this approach is more accurate than existing spellers, both in terms of diagnostics
and of correction proposals.
most spellers are not context-sensitive, i.e. they do not recognise errors which lead to
"legal" words such as agreement errors, fixed phrases etc. In TWB, an extended speller
has been developed for German, which not only recognises more difficult cases of phraseology (e.g. German capitalisation problems) but also incorporates intelligence to recognise some basic agreement errors (e.g. within noun phrases, noun - verb congruency).
This shows that spelling correction requires more linguistic intelligence than a mere
lexicon lookup.
The result of these efforts was a Greek speller released as a product for DOS. and
improved quality for spellers with more linguistic intelligence. 1\vo issures remain, however: the spellers must be partially redesigned from a software engineering point of view to
compete with the existing ones, and they must be extended in terms of language coverage.
19
This is a problem for most of the TWB tools: As they involve more and more linguistic
intelligence, porting them to other languages requires considerable efforts, not just in
terms of lexicon replacements but also in terms of studying the syllabic structures of European languages in more detail, or even developing grammars. It may turn out that an
approach which works for one language does not work for another one. This is a drawback
from a product marketing point of view as several parallel approaches must be developed
and maintained.
In the case of grammar checking, it turns out that most existing grammar checkers are
not reliable, and therefore are very restricted in their usage. This is due to the fact that
most of them do not use a real grammar but are based on some more or less sophisticated
pattern matching techniques. However, the fact that they sell shows that there is a need for
those tools.
TWB again followed several approaches in grammar checking:
We developed a small noun phrase grammar based on augmented transition networks
(ATN) for German in order to detect agreement errors. An analysis of the errors made
by foreign langguage students indicated that these noun phrase agreement errors are
among the most frequent grammatical errors in German texts (please see the section on
Grammar and Style Checkers for more detailed information). This grammar is planned
to run on a DOS machine, a~ part of the extended speller mentioned above. The problem of partial parsing is of course to find the segments where to start parsing (noun
phrases can have embedded clauses, prepositional attachments etc.)
A second approach has been followed for Spanish grammar checking: Here we used an
existing grammar (the METAL analysis) and enriched it by special rules and procedures to cover ungrammatical input During parsing, it can be detected if one of those
special rules has fired, and if so, the appropriate diagnostic measure can be taken.
This approach adds a "peripheral" grammar to the core grammar which tries to identify
the cases of ungrammaticality (agreement errors, wrong verb argument usage, etc.). Its
success depends on two facts, however: First, the grammar writer must have foreseen
the most frequent types of errors in order to allow for the grammar to react on it; and
second, the coverage of the core grammar must be good enough in order not to judge a
parse failure as ungrammatical sentence.
A third approach was developed for German again: Based on a dependency grammar,
we tried to implement an approach called "unification failure". The basic idea is that
the grammar itself should decide what an ungrammatical input may be and where an
error could be detected (namely where two constituents cannot be unified into a bigger
one).
This approach is backed by the study carried out for German mentioned above, which
shows that nearly all kinds of errors one could think of really can be found in a text corpus; therefore it may be difficult to predict those errors in a "peripheral" grammar
approach.
The basic algorithms for the ''unification failure" approach have been developed and
implemented; some theoretical problems still have to be solved; see the chapter on syntax checking below.
20
As a result, it turned out that grammar checking needs much more linguistic intelligence if
it is to be helpful and reliable. It requires fully developed lexicon and syntax components
and some "heavy" machinery (in terms of computing power). The 1WB tools are the better
the more developed the underlying grammars are. This again hampers their portability to
other languages as it means considerable investment
A last area of language checking was style checking, or more specifically verification of
controlled grammars. This is more closely related to the documentation business as it tries
to implement guidelines for good technical writing, conventions for style and layout, also
implying language criteria.
Such a verification of controlled grammars has been developed for German, based on the
METAL grammar. It implements diagnostics for guidelines such as
Don't write sentences longer than 20 words
Don't use too complex sentences
Don't use three and more part compounds
It analyses texts sentence by sentence and gives diagnostic information for each of them if
necessary.
This approach seems to be feasible if the grammar coverage is large enough. It is experimented with by several documentation departments; it has to be extended to other languages.
The term bank is used in two ways: It is a medium to store and edit terminology, and it is a
medium to retrieve terminology during the translation process. As a basic software device,
an Oracle relational database was chosen; this was integrated both into the MATE terminology toolkit and into the term retrieval component. The structure of a terminological
entry has been defined in a "multilayer" approach, and several thousand example entries
(for the area of automotive engineering) have been implemented.
During the translation process, access to these data must be provided from the translator's
wordprocessor. The TWB tools offer several possibilities here:
21
The easiest way is to use the Cut and Paste facility offered by the UNIXlMotif window manager. Users simply highlight a text portion, paste it into the search window of
the term bank, retrieve the result, and paste it back into the text. Although this approach
works in general, it has problems in case of formatting characters, blanc spaces, and so
on. Also, it is not the fastest way to do it.
On DOS, it is possible to ask for lexicon information (the area being commerce,
finance, and law) from the WinWord editor, by using a hotkey and activating internal
links. In this way, translators can look up and paste terms into their document.
This has been achieved on UNIX by implementing a special interface into the FrameMaker desktop publishing system. This interface has also proved to be suitable for
other linguistic applications.
The success of using a term bank depends on the terminology which is stored in it; this is
not a software issue but an issue of providing terminology in different areas: If it is too
expensive to fill a term bank, or if users do not find what they search for, it will not be
used.
Therefore it is essential to provide tools for terminology maintenance. lWB has developed software for terminology maintenance and corpus work: The MATE system comprises corpus analysis tools (production of wordlists, indices, or keywords in contexts),
term inspection tools, term bank maintenance and editing, printing facilities, and so on.
This allows for empirical and corpus based terminology work. Providing good terminology is central for the term bank software.
Another possibility for checking terminology is to look up external term banks, like
EURODICAUTOM, TEAM, or others. This is possible with the lWB remote access
software. Users can access the EURODICAUTOM database and search for terms they
need to know. While this is technically feasible, it is time-consuming and expensive (in
terms of line costs); it would be simpler to download several modules of external term
banks like EURODICAUTOM into the local term bank (which raises the question of copyright problems)
In addition to the "official" terminology, released by a terminologist, and stored in a term
bank, it transpired that a more private device could be helpful where translators store their
particular information of any kind, ranging from private hints to phone numbers of
experts.
Therefore, we developed the Card Box in 1WB which is meant to support this kind of
information base, and to allow for online access. It is implemented in the manner of a
hypercard stack which can be looked up back and forth.
All the tools presented above need to be improved to be fully operational: The term bank
functionalities must be better integrated, the editor - term bank connection must be stabilised, the different term formats must be made exchangeable by creating terminology
interchange formats and software support. This will be an issue in the MULTILEX Esprit
project. The MATE functions must be made compatible, a common user interface with
just the same look and feel must be designed.
22
Other functions are missing in lWB, e.g. the possibility of looking up the words of a text
in a term bank and extracting relevant information, for instance to produce glossaries, lists
of "illegal" terms, synonym links, inconsistency checks, etc.
The lWB Translation Memory is a tool which looks up patterns in a database of previously translated text, and replaces the input patterns by its correspondence in the target language. The system consists of a training part which asks the user for correspondences;
these relations are interpreted in a statistical model. At runtime, this model is processed to
detect the target language patterns for a given input string.
Although it has only limited linguistic knowledge, this approach is promising in the area of
phraseological and terminological correspondences, i.e. where local decisions can be
made; and it should be trained for many small-scale text types rather than for one universal
text-type.
It will help the translators to translate fixed expressions of high frequency, and expressions
which have been translated before.
If texts are well written and repetitive enough, they can be completely translated by
machine. Machine (pre-)translation of technical text will have a large market share in the
future, given the constraints of the documentation process outlined above.
In order to experiment with this approach, 1WB implemented the possibility of remote
access to a MT system, in this case METAL.
Users send their text, specify the text format, and the lexicon modules to be used for translation, and send their text via an ODIF I X400 connection to a translation server. It is translated there and the raw translation is sent back.
The success of such an approach depends on the lexicon maintenance, the quality of a
translation, and the ease of postprocessing.
Overall, the translation process can be organised in a very flexible way with the lWB
tools:
Users can translate "by hand" and look up terminology using the lWB lexicon lookup
tools
Users can pretranslate frequently used patterns using the Translation Memory, and
translate only the rest manually
Users can send the text to a translation server and postedit the raw translation being sent
back.
Again, improvements can be imagined in this area: A common user interface, a better
training facility for the memory, an access functionality beyond X400, additional tools like
sophisticated search-and-replace functions would support the translators even more. But
23
the general direction can be recognised: To react to the translator's needs in a flexible way,
with a set of supporting tools.
6.
Gerhard Heyer
Infonnation systems buying in the 1990s is generally expected to polarize, on the one
hand, into growing and profitable so-called operational applications, and stagnating, socalled personal productivity and administrative applications, on the other hand. Personal
productivity and administrative applications are primarily intended to reduce administrative costs, while by operational applications are meant applications that add new and better
services, improve operational flexibility, or reduce operational costs. The growth in operational applications is expected to dictate the structure of computing in the 1990s, where, in
particular, the operational activities will be organization-specific.
In addition to horizontal software applications, therefore, software development will also
have to focus on functional solutions, i.e. general solutions to problems that are common to
a number of vertical applications without becoming a product for the mass market.
Considering the main standard software packages for the PC (data base systems, integrated
software packages, word processing, DTP, spreadsheets, and graphics), text processing as
the key productivity tool is and will remain in the near future the single most important
horizontal software application.
However, in accordance with the general tendency towards more operational applications,
saturation for the word processing market is forseable, and expected to have effects from
1994 onwards.
In 1991 the main trends for text processing software are:
Linguistic tools to enhance text processing packages on the PC today comprise standard
packages like:
spelling checkers,
proofreading tools,
thesauri and synonym dictionaries (monolingual dictionaries),
translation dictionaries,
translation support tools,
fully automatic translation,
remote access facilities to large automatic machine translation systems.
25
Definitions:
Spellcheckers check each word against a list, or dictionary, regardless of context, and
highlight only spellings which do not exist with respect to this list
Proofreading tools make use of linguistic knowledge in terms of more complex dictionaries (like Duden for German) or grammar rules, in order to identify orthographical errors
the correction of which requires knowledge of the context (e.g. correct use of article, capitalization,lack of agreement between subject and verb).
In the market forecasts below, spellcheckers and proofreading tools are collectively
referred to as text editing tools.
Thesauri, synonym dictionaries, and term data bases list for each word its definition(s) and
possible alternatives.
Translation dictionaries list for each word or phrase its translation(s) in one or more other
languages. All electronic dictionaries can be called stand-alone or integrated into the text
processing system.
Translation support tools are editors and systems for interactively composing or correcting
translations.
Fully automatic translation systems are software systems that take some text as source text
and non-interactively translate it into the target text.
Remote access facilities to large automatic machine translation systems is software for
obtaining fully automatic translations via network services (e.g. via X400).
7.1 The Use of Standards for Access from TWB to External Resources
Introduction
One of the aims of the TWB project is to be open to the outside world. For this reason,
some tools that interconnect TWB with external resources have been developed.
These tools include access to external term banks and access to machine translation systems, using X.400 as communication mean, and ODA/ODIF as document interchange format.
X.400 message handling systems can be used for many applications, apart from normal
InterPersonal messaging [CCITI 1988] [ISO 10021]. The Office Document Architecture
(aDA) and Interchange Format (aD IF) standard [ISO 8613] provides a powerful means
to help the transfer of documents independently of their original format. One of our user
applications combines the use of both standards, in order to provide remote access to
Machine Translation Systems (MT-Systems). In this way, documents inside X.400 messages are ODIF streams.
Furthermore, the aDA standard allows lWB users to incorporate into their environment
documents generated from different word processors. Converters have been developed
between several word processor formats, including the one used in TWB (FrameMaker),
and aDA, and between the METAL format (MDIF) and aDA. The second level aDA
Document Application Profile [Delgado/Perramon 1990] has been implemented (Ql12J
FOD-26 [CEN/CENELEC ENV 41510] [ISO DISP 11181]). Therefore, raster and geometric graphics, as well as characters, can be converted.
An important advantage of the developments made for remote access to MT-Systems is
that we can use the implemented software outside lWB: any X.400 system with aDA
capabilities could be able to interchange documents with MT-Systems, and, on the other
hand, aDA converters could be used to incorporate different word processor files into different word processing systems.
30
However, MT-Systems usually need more information that has to be coded in the heading
of the X.400 message. Examples are:
Operation (Translate, Pre-Analyze, ... );
Language pair;
Thematic area of the document;
The solution we have taken for sending this information from our system to the MT-Systern, via X.400, is to use the "Subject" field defined in the X.400 message heading. Hence,
operation, language pair, and thematic area are coded, straight-forwardly, in the subject
attribute of the message. However, if the MT-System needs more information, another
solution should be adopted.
31
32
Q112 (or the ISO equivalent FOD-26 [ISO DISP 11181]) allows for the interchange of
multi-media documents between advanced document processing systems in an integrated
office environment. FOD-26 documents may contain characters as well as raster graphics
and geometric graphics.
Q113 (or FOD-36 [ISO DISP 11182]) provides the features supported by Q112 and, in
addition, allows more complex logical and layout structures.
Although we initially chose the Q111 profile because it was adequate for our purposes
(remote access to an MT-System), we have finally developed Qll2JFOD-26 converters in
order to take advantage of all the existing word processor facilities.
between the TWB system and METAL is done through standard X.400 electronic mail.
For this reason, an X.400 system (based on the results of the CACTUS ESPRIT project
[Saras 1988] [Delgado 1988]) has been integrated in both TWB and METAL systems. A
special X.400 user agent that interacts with the MT-System has been developed and interfaced with METAL (through "X-Metal"). The role of this user agent, installed in the MTSystem side, is to receive X.400 messages from outside, and to generate replies back with
the translated document.
The use of ODNODIF to send documents to MT-Systems, guarantees the standardisation
of the input and output formats, and it allows a translation to be returned with the same
structure and format as the source text.
The use of X.400 to access the MT-System provides a widely available standard means to
access the service, avoiding the need to define a new access mechanism. Documents (in
ODA format) are sent inside X.400 messages.
Once a message has arrived in the MT-System, the following steps are taken:
The ODA document is extracted from the X.400 message body;
The content of the document is translated to the required human language;
One reply message is generated back with the translated document (converted to ODA
format) inside its body.
7.2.1 MHS Environment
We have already stated that the X.400 Message Handling System we are using is based on
CACTUS, which provides P7 access [CCITT 1988] [ISO 10021] to their users.
A CACTUS user is able to perform a set of distributed operations acting as a client, by
means of a mailbox client (MBC), of a CACTUS mailbox server (MBS). The communica-
33
tion between MBC and MBS is made through a logical or physical connection (ie from the
same machine or from direct wire, modem, etc).
A mailbox server can support a cluster of different users, each one identified by a mailbox
name (the address of a user). But one of the more interesting features of CAC11JS is that it
allows the server to run special tasks automatically when a message for a particular mailbox is received. These special tasks, called task-mailboxes, are in fact processes that handle messages in the way the designer of such processes dictates.
Therefore, we have the adequate environment to send X.400 messages, using CACTUS,
to any machine around the world. We can also activate the suitable tasks on the receiving
side of these messages in order to route them to the machine translation system and generate back replies with the translated document.
7.2.2 X-Metal
Fundamenta1s
The current lWB implementation of the remote access to the METAL machine translation
system using the standard X.400 MHS networks, has been made using the CAC11JS
package, developed by the UPC, as the underlying MHS software, both on the lWB
workstation side and on the METAL side. The module, X-Metal, interconnects the MHS
system with the METAL system on the METAL side.
Error Handling
X-Metal reacts to a number of possible errors which may arise during processing. These
errors can be of different types: wrong translations parameters set by the user (e.g. the user
specified a language pair which is not supported by the local METAL system), wrong
incoming document format (the document is not in ODIF format), malfunction of the converters, malfunction of the METAL system, and so on. In most cases, X-Metal reacts
sending a message to the user containing information about the particular error encoun-
34
teredo If possible, X-Metal will retry the operation later (for instance, in those cases where
either METAL or the MHS outbox are not accepting documents for some reason).
Portability
Although the current implementation of X-Metal is running on the UPC CACTUS package, it has been designed in such a way that it can be easily ported on top of any MHS
package which has an API (Application Programming Interface) similar to the P7 protocol.
35
specifying an interface between them. Apart from these modules, we need converters
between ODIF and the stored ODA format
The analysers scan the input document (in SODA or word processor format) and generate
a series of function calls to the corresponding generator, which creates the output document structure. The SODA analyser and SODA generator modules are unaware of any
word processor document structure, and so are the word processor modules with the ODA
structure.
The function calls constituting the interface between the analyser and generator behave as
a sort of "intermediate document format". We call this interface ODAPSI ("ODA Profile
Specific Interface"). The ordered sequence of function calls generated by the analyser can
be regarded as a (sequential) document description, in the same way as a word processor
sequence of formatting commands and text contents, or as an ODA specific structure.
The modules are interchangeable (provided that they generate/accept the same function
calls or "intermediate format"). For example, let WPI and WP2 be two word processor
document formats. A WPI <--> SODA converter can be turned into a WP2 <--> SODA
converter by simply replacing the WPI analyser/generator with a WP2 analyser/generator.
Also, the SODA module could be replaced, resulting in a direct WPI <--> WP2 converter.
7.3.3 The Intermediate Format (Analyser-Generator Interface): ODAPSI
The ODAPSI format describes a document (its logical structure) in terms of the commonly used word processor components, e.g. section, paragraph, style, and so on (these
terms are also used by the DAP's). In order to simplify the interface, this description is
sequential, and approximately follows the order in which a human user would describe a
document from a keyboard in front of a word processor (this order may be different from
that of the word processor document's internal structure).
The typical sequence of calls generated by the SODA analyser for every object found in
the specific logical structure is:
create logical object;
layout style (actually, this is an argument to the 'create' function)
for composite logical objects:
- recursive calls for each of the logical subordinate objects
for basic objects:
- presentation style attributes
- content information;
close logical object.
7.3.4 Results
A very flexible scheme to build word processor <--> ODA converters has been descrilxid:
Clear internal formats, internal modules and internal interfaces allow a sound basis for the
development of converters.
36
For our initial purposes, the Qll1 Document Application Profile was adequate. Experience was obtained in the development of Qll1 converters for common word processors,
like WordS tar, Microsoft Word, WordPerfect or troff.
However, we finally developed software based on Q1121FOD-26 in order to provide
more general purpose converters. Work continued to develop Q112 converters for WordPerfect and FrameMaker, the internal word processor of TWB.
Content Extraction
In this step the SODA document is scanned for text associated with basic objects. This text
is extracted and a temporary file is created with it. At the same time, control characters
inserted throughout the text and having their 8th bit set to one are detected and converted
into mnemonic string sequences. This is necessary in order for the LEX interpreter in the
next step to operate properly. Finally, the MDIF file header containing some required
parameters is created.
Since Q112 ODA documents may contain graphic content portions, we have to filter them
in order to exclude them from the MDIF file. However, the possibility exists to handle
graphics described in CGM in order to extract the existing text from it and to insert this
text as part of the MDIF file. This is open to further study at the moment.
Table Handling
In this step, the temporary file is scanned for a tabulator character pattern which indicates
the presence of tables in the document. Should a table be detected, the temporary file is
37
given explicit information on the table in the form of special string sequences. These
string sequences indicate where the table starts, where it ends and its column pattern.
38
The communication is done through the X.25 public data network with several options
available for the installation of the network connection in different hardware environments. For example, we can access the X.25 network from any computer (acting as a client and running TWB) in a local area network, where we have a computer (acting as a
server) with the physical X.25 connection. In the case of using Sun machines, SunlinkX.25 software is what we require to interface to X.25.
Although the current version runs on UNIX, plans are underway to port the remote access
to EURODICAUTOM software to PC, with and without the need of an X.25 connection.
It does not contain grammatical information about terms (syntactic category, gender,
tense, and so on);
Depending on a) the language a term pertains to, and b) its subject field, EURODICAUTOM provides a different amount of information, and a different degree of information reliability.
Level of information required: basic (only the main answer) or complete (all possible
answers);
Use of subcodes.
39
Type (or Group. It indicates the symposia, meetings or journals where the question was
discussed);
Originating office (indicates the office or entity which has gathered the terminological
information);
Reference (source of the terminological information);
Reliability rating (from 0 to 5, being O=no source, and 5=from a standard);
Country (term indicating the country to which the entity asked for belongs to).
II) Linguistic information:
8.
When dealing with tools for linguistic text processing, the language resources, i.e. dictionaries are necessarily a central issue. In this chapter we will present the idea of reusable lexical resources, as it has been proposed and is being carried out in the ESPRIT project
MULTILEX. In the second part we will give more information on a dictionary for the special purpose language of law, commerce, and finance, as it is implemented in the present
PC Translator'S Workbench.
(2) exploiting already existing lexical resources for different applications and different
theories, typically exploiting one lexical database in different applications.
Arguing from a software engineering point of view, we shall present and discuss in the following the idea of compiling application specific lexica on the basis of a standardised lexical database as presently elaborated in ESPRIT II project MultiLex and applied in the
project Translator's Workbench. In contrast to work focussing on either one of the two
aspects of the notion of reusability, the compilation approach is intended to integrate both
aspects, and to optimally support the design of natural language products.
The intuitive idea of the compilation approach is to construct highly efficient and wholistically designed natural language applications on the basis of linguistic knowledge-bases
that contain basic and uncontroversiallinguistic data on dictionary entries, grammar rules,
and meaning definitions independent of specific applications, data structures, formalisations, and theories. To the extent that linguistics is a more than two thousand years old science, there is ample theoretical and empirical material of the required kind available in the
form of written texts and studies on the source level of linguistic knowledge, or can be
41
(and is) produced by competent linguists. However, very little of this knowledge is available on electronic media. Thus, the very first task of software engineering for language
products, as Helmut Schnelle has recently put it (Schnelle 1991), must be the transfonnation of available linguistic data from passive media into active electronic media, here
called lexica, grammars, and definitions on the linguistic knowledge base level. In tenns
of implementation, such media will mainly be relational, object-oriented, or hypennedia
databases capable of managing very large amounts of data. Clearly, in order to be successful, any such transfonnation also requires fonnalisms to be used on the side of the linguists for adequately encoding linguistic data, the linguistic structures assigned to them,
and the theories employed for deriving these structures. Moreover, within linguistics, we
need to arrive at standards for each such level of fonnalisation. In reality, therefore, the
first task of software engineering for language products is quite a substantial one that can
only succeed if the goal of making linguistic knowledge available is allowed to have an
impact on ongoing research in linguistics by focussing research on fonnalisms and standards that can efficiently be processed on a computer (see also Boguraev and Briscoe
1989).
source level
linguistic
knowledge base
level
Lexicons
Grammars
Definitions
application level
42
Lexical
Database
Pragmatics
Semantics
Syntax
Morphology
Orthography
Phonetics
??
TFL
TFL
TFL
ISO
CPA
SGML conversion
On the source level, there are printed dictionaries, text corpora, linguistic intuitions, and
some few lexical databases. In order to make these sources available for language products, we first need to transform the available lexical data according to an exchange stand-
43
ard into a representation standard on the level of a lexical database (for each European
language). The exchange standard proposed by MULTILEX is SGML, following recommendations from the ET-7 study on reusable lexical resources (Heid 1991). The representation standards proposed for the different lexical levels are the Computer Phonetic
Alphabet (CPA) for the phonetic level, the ISO orthographic standard for the orthographic
level, and a typed feature logic for the morphological, syntactic, and semantic level. In this
functional view, implementation details of the database are irrelevant as long as it allows
for an SGML communication.
From a software engineering point of view, when dealing with large amounts of lexical
data, a number of problems arise that are similar to choosing an appropriate implementation representation in database systems. The key issue for large lexica here is data integrity. Maintenance operations like updating, inserting, or deleting have to give the user a
consistent view on the lexical datebase. All problems that arise in database management
system at this level also arise in a lexicon. We therefore suggestthe use of a database system as the basis of our lexicon in order to save work that otherwise would have to be done
at the level of the lexical database. The same point also holds for distributed lexical databases, e.g. for lexica that are spread over and maintained in different countries.
Since most existing natural language applications use lexica that have been defined and
implemented only to fulfill the application requirements (application specific lexica), the
reusability of such lexica is a problem once one wants to use the same lexica in different
contexts and applications. With respect to the maintainability of such software systems,
we claim that real reusability can only be assured if it is based on a standard representing a
general lexical representation. The MULTILEX representation based on the notion of a
typed feature logic can be considered such a standard.
In many applications it may be possible to use the MULTILEX format without any
change. Typically, such applications are not subject to narrow time and space constraints.
If a system has access to a host with a large amount of secondary storage (WORMS, huge
disks), and no time critical operations are required, it can use the functions of MULTILEX
without modification. Batch systems (e.g. machine translation in a background process)
may interact in such a way.
On the other hand, a number of applications are time critical, typically all systems directly
interacting with the user, e.g. spelling checking or handwriting recognition. Additionally,
such systems havelimited space constraints (e.g. a PC). For such systems it is therefore
necessary to provide compilers that select the necessary information from the MULTILEX
lexicon and compile this information into a special data structure which supports the operations needed by the application in an optimised way. In general, applications that make
use of main memory or memory cards as their lexicon storage medium need other representation formalisms (e.g. AVL trees) than hard disk based systems (which may use binary
trees).
Finally, the compilers built for producing application specific lexica not only support an
optimised data structure, but also support additional operations like selecting a specific
subset of the lexicon entries (one can think of SELECT in relational terms), and selecting
a subset of variables of a lexical entry (PROJECT in relational terms). Such operations,
for example, may be the selection of verbs or nouns with specific characteristics.
44
The approach sketched above is presently being used successfully being used to develop
specific lexicon based language products such as multilingual electronic dictionary applications for human and machine users in the area of automatic and semi-automatic translation support, highly compressed multilingual spelling correctors and language checkers,
and highly compressed lexica for optimising handwriting recognition, as can be seen from
the following sections.
Table 2: Example entry with more than one meaning; different meanings are given
in square brackets "[ ... ]" (from Herbst & Readett, 1989; page 1)
abandem @v (A) I to alter; to modify; to change # changer; modifier I eine Erkllirung - I to modify a statement # modifier une declaration
abandem @v (B) [erganzend -] I eine Entscheidung - I to revise a decision #
revenir sur une decision I ein Gesetz - to amend (to revise) a law # amender un
projet de loi I einen Gesetzesentwurf - I to amend a bill # amender un projet de
loi I einen Plan - I to amend (to modify) a plan # apporter une (des) modification(s) a un projet
abandem @v (C) [berichtigen] I to rectify; to correct # rectifier; corriger
abandem @v (A) #to alter; to modify; to change #Changer; modifier #eine Erkllirung abandem# to modify a statement #modifier une declaration
abandem @v (B) [erganzend abandem] I eine Entscheidung abandem#to revise.
a decision#revenir sur une decision#ein Gesetz abiindem#to amend (to revise a
laW#amender (modifier) une loi#einen Gesetzesentwurf ablindem#to amend a
45
Each volume contains about 100,000 tenns which are organised in about 40,000 entries.
Each entry contains a main tenn (source language typed in bold with its translation equivalents plus tenns which are related to the main tenn (see Table I). Tenns also contain a
description of the word category (e.g. @adj for adjective, @v for verb and so on). The following examples are taken from the converted printer tapes. They differ from their equivalent printed entries as far as some changes have been made to the printing fonnat (no
bold printing, italic printing replaced by the character'@').
Tenns which have different meanings are separated as entries and numbered (see Table 2).
If more than one meaning exists a description of the special meaning may follow.
type setting errors: forgetting the separating marks between languages; cutting off
entries where continuation text was clearly necessary; insertion of meaningless characters in entries and similar errors; entries used a special abbreviating code and this code
was often used wrong or in an unpredictable fonn. Some of these errors could be recognised by the parser in a purely syntactic way, e.g missing French translation equivalents; intenningling French and English translations)
linguistic errors: only a few could by found using a parser, which could only check for
syntactic errors as mentioned above. This errors were found by manually inspecting the
entries and lists when the indexfiles for the Windows applications were generated. As
an example "ss" and "6" were not used correctly.
46
The process of eliminating these errors was quite costly and had to be repeated several
times (in many cases one error hid another error, which could only be detected with the
next parser run). Human readers in most cases are able to correct these errors by their
knowledge of the language (e.g. if the separation between languages is missing, in nearly
all cases it is obvious for a human reader to find and place the separating mark.).
the user is working with his or her text processing system, e.g translating a letter or
writing some foreign language text and wants to get the translation equivalent for a
term. The user can mark the term to be translated and then activate the HFR-Dictionary
from within the application (eg Word for Windows; a macro for this text processing
system has been implemented). This can be achieved in two ways: a) by using the
47
standard copy and paste facility of MS Windows and b) by using the DDE approach.
The later requires some kind of macro language within the text processing system and
the support of basic DDE functionality. For more details please see the chapter on integration of the TWB modules.
When starting the HFR-Dictionary the user gets one source language menu and one to
three target language windows. The source language menu contains all the entries for specific single letter Oike a printed dictionary; e.g. all entries for the letter "A"). After clicking
on the desired entry the translation equivalents are displayed in the target language windows (e.g. one window for English equivalents and one for French equivalents). In most
cases the user will only use one target language window. Users can save their preferred
settings; when starting HFR-Dictionary again they will have the same user interface environment as the last time the application was used.
Additionally the user may enter terms in an input window and get the translation equivalents in this way. He or she also may copy the contents of the clipboard into the input window. As an additional feature the user can choose to adapt the source language menu to the
current input string of the input window (e.g. when entering "qual" the source language
menu positions at the words starting with the string "qual" like "qualitativ", "QualiUit",
"QualiUitskontrolle" ... ). As this is sometimes quite time consuming the user can switch off
this feature.
When the user is satisfied with the translation equivalent he or she can mark the appropriate parts in the target language window and copy it to the clipboard or transfer it with a
special button to the calling application.
48
As has been described above, within the ESPRIT-Project 5304 MULTILEX a standard
representation for lexical resources both multilingual and multifunctional is under development. The HFR-Dictionary will be converted into this format once the standard is
defined. This implies that the HFR-dictionary data will also be stored in a relational database (using ORACLE).
8.2.6 Conclusions
The conversion from printer tapes into a computational lexicon is not as easy as it may
first look. Thus having only the printer sources is not enough. One also has to invest a considerable amount of time into both syntactically and manually improving the sources.
However, once the dictionary is in an appropriate form various types of applications can
be derived from it.
9.
The Commission of the European Communities has recently estimated that 170 million
pages of text are translated per year in Europe alone and that this figure will increase to
600 million pages by the year 2000. Despite much useful research and development in the
field of machine translation, the fact remains that much of this work is still carried out by
human translators, with or without such valuable aids as terminological databases.
Most professional human translators have studied a foreign language and are therefore
familiar with language for general purposes. The problem translators have to overcome
lies in dealing with language for special purposes. As even experts do not always know all
the details of their subject area, one can easily imagine the enormous difficulties translators who in general are not subject experts encounter when translating manuals, reports,
announcements, and letters at all levels of detail in several subject fields. Therefore they
need support in the special language terminology.
For looking up special language terms, translators use printed resources like lexica, thesauri, encyclopaedias and glossaries. Recently, they have also begun to use the electronic
medium. Nowadays, almost all translators make use of computers, especially of wordprocessing software, which supports them in creating, writing, correcting and printing
translated texts. In addition to word processors, tools which support terminology work are
of increasing interest. These tools comprise computerised lexica, terminological databases
and private computer card files.
Equivalent
r
Equivalent
Comment
Sense
Relation
Source
Domain
Entry
~
Image
+
Elaboration
Encyclopaedic
Header
Encyclopaedic
Unit
Fig. 9: Conceptual model of the TWB tembank
50
Tenninological databases, or, in short, tennbanks, are meant to support translators and
experts in their daily work. They contain tenninological data on (several) subject fields.
Apart from the tenns as such, a tennbank entry often contains additional infonnation such
as definitions, contexts and usage examples, as well as relations between the entries such
as 'is-translation-of', 'is-synonym-to', 'is-broader-than' etc. In order to access the stored
tenninological data, a user interface has to be provided. The user interface has the task of
offeting the user an easy way to retrieve or modify tenninology. Thirty years ago, the first
tennbanks were introduced in organisations where tenninological support is especially
needed, like:
Based on this infonnation, the SU and lAO teams have jointly developed the prototypical
tembank entry and the conceptual, logical and physical structure of the tennbank. SU and
HD have elaborated tenninology for the subject fields "catalytic converters" , "anti-lock
51
braking systems", and "four-wheel drive", and SU entered the data (Ahmad et al., 1990).
The lAO has designed and implemented a retrieval interface (Mayer 1990), which was
tested and evaluated by translators at Mercedes-Benz AG (Hoge et al., 1991).
: aLalv.~[
:arlllVSl
.
-
~~
..........
rrwnolilh catalys
oellelell :alalVS
52
as "is-elaborated" between entry and elaboration. Since the tennbank group decided that
all entries have the same status, the translation equivalent of an entry, for example, is
stored as a relationship between the corresponding language entries.
The entity elaboration contains textual infonnation such as definitions, context examples,
collocations and usage infonnation. The encyclopaedic units are also textual infonnation,
but of a special kind. The encyclopaedia was included because translators often need more
infonnation on technical terms than a definition can provide. An encyclopeadic explanation gives translators, who normally are not subject experts, insight into the technical
background. Several terms can be linked to a single encyclopaedic unit, which contains
infonnation about linked tenns. The encyclopaedic units are not isolated but grouped and
linked. Every unit has a unique heading, which can be accessed via the tenn. Encyclopaedic units often comprise headers of other encyclopaedic units. The units and the headers
form a group; all headers form a non-hierarchical network (see Fig. 10). The user can
browse through the network, following the header structure up and down.
Each entity has several attributes. Because the tennbank entry is the central entity, its
attributes are listed here:
entry, ie a word, a group of words, a phrase;
short grammar, eg indicating gender, part of speech;
language, given in short fonn: en, es, de, for English, Spanish, Gennan;
country: US, UK, DE, ES for the United States, United Kingdom, Gennany, Spain;
status: r, a, g for red, amber, green standing for the term validation (i.e. red = not validated, green = fully validated);
termstatus: eg pre, sta, int for preferred, standardized, internal.
Beside these attributes, the date of insertion or last modification, as well as the name of the
responsible tenninologist is recorded.
53
Having started the termbank by selecting the termbank button in the toolbox, the user first
of all has to define the information categories he/she wants to be displayed on the screen.
This is done by means of the specification window (see Fig. II).
I
"
I~
T.a: ca,{a.CII)
~: "\I~.Cllw,.(CII"GB)
~ : ILUOOl\:rl'.ClltaJ,a.:~ (Cll,0 8)
9yDr
=:
., --..
~_L...-I
..... \
=
=
.... - ..........
:9rya.
s,...
I"""
=: - ............
...:;.;
!Oi.
C......I ~tIoa.
s,...
.fl
'I'
..uk. Cli
Oc;.....,
O~
OConou
o.Ill
-.".....
a eon.......
o c~
.--
-_
0 -..
ODoM<....
_ "-
0 _1I<Iodoo
eoa...
The user can save individual specification profiles. After having typed in a search term,
the respective information is offered in the retrieval window. Additional information can
be accessed using the pull-down menu further info.
Term access is supported by including spelling variations and wildcard search (see Fig.
12). Then a list of possible mappings is given. The user chooses one of the terms offered
by a double-click.
1 I'l1o .....1..<
T_
,.
, .. , - t . t ..
'1"0010
1--.
I! e- I
~~lCb-lr:o
1....1
IR
,'-.-.... -."""bI.
bI~1
. ::!.i:"~.h:::'" ....
eUIust (CIIlCl()Q)
eUIust aftab<tntiD,
(CII alii)
'-U"'jecoo
(AE)
bd:pro:ssuro
_bladtawlc
(CllalII)
(AE)
(AE)
'-'UIIlWIR
eUIust 1DIi)'$l$ (CII alii)
I ....... braIz
CAE)
1c....1
Fig. 12: Tennbank retrieval with wildcard
II'
II
54
Because most translators usually read the whole text to be translated and mark the
unknown terms before starting the actual translation, we provided the tool list search. The
user can cut and paste all unknown terms into the list search window and obtain the
retrieved information in a new results window (see Fig. 13). He/she can also print the
results, if a hard copy is more appropriate, or save them in a file for later sessionsIn our
opinion, it is important to extend the functionality of the interface in places where a kind
of browsing facility is possible. A graphical overview of information in connection with a
requested term should be implemented.
Browsing is an important information-seeking strategy for users - especially for novice
and casual users.
Marchionini 1988 states three reasons why people browse:
They browse because they cannot define their search objective explicitly. They often
proceed iteratively, beginning with a broad entry, browsing for related entries, and
looking for additional entry points.
......
&Ii,
Filo
T.rm: cal
Ii
IE
55
Gn.nulAt
uadw.bmfo~
~k odar M.,oll
lu.s mahrcren tausead
,..--;-_-:-----,~4.
well. Selecting one of these headers brings on the corresponding text in the text area. The
user can also list all headers in alphabetical order and obtain a history list of all headers of
all units he/she has accessed in the session.
66
mation relevant to the work at hand. The specific arrangement of the cards, then, is crucial,
because it usually has a unique order, tailored to the job and only obvious to the creator of
the system. Other translators have several hundred cards organised in a single batch and
ordered alphabetically. Often English and German term definitions are mixed with names
of companies, experts, abbreviations, etc. The cards cover a wide range of information
from straight monolingual definitions, bibliographic hints, short graphics, mathematical
symbols, and idiomatic expressions to translation equivalents.
Therefore, the Cardbox, which functions as a private termbank, was planned to cater for
individual working strategies and styles. The cardbox allows the translator to define individual cards online. The advantage of this is that the translator does not have to change the
working medium since the term bank and the cardbox (as a kind of private termbank) can
be accessed simultaneously.
Card Stacks: Definition of any number of card stack templates, where the number and
name of attributes may be chosen freely. Creation, addition, and deletion of templates or
any attributes of a defined template are possible.
57
FUe] Edi,
File Edit
_ , Tempi,,,, Nun..
Selection
Help
'H
Opm
"0
Qli'
SelecdOl\
11EST
in any of the attribute fields causes the respective cards to be displayed. Wildcards can
also be used. The number of selected cards is displayed.The user can browse through the
selected cards using next and previous.
I -~
~L
.-----
File Edit
DEUTSCH
ENGUSCH
Label
Type
I NUMBER 01
IAdd/Updatel
How can we make the interface of a termbank more attractive? In the area of user interface
development, additional multi-media tools like video, animation, sound, pictures etc. are
being investigated on a large scale. Interpreters or learners, for example, may be interested
in the correct pronunciation of a term. Experts could be supported by an animation component which explains the functions of a machine.
58
Another addition may be the full integration of cardbox and termbank. Users should have
the possibility of copying information from the termbank onto their cards. Another useful
addition to the cardbox would be a graphical component, and the possibility of having
more than one card file open and visible.
'_I
213
~~
I Bottom 1~
10 _1
File &lit
l1nkl
,-,
ENOUSH IC<hllelDllllber
SPANISH l.[ndicec:eWUco
ENOUSH
213
Di::J~
I Bottom 1 ~
IClearI
4/S
I~enumber
~~
I Bouo", 1
laearl
10.1 Background
A special language, or a LSP, is the language of experts in a narrowly defined domain. It
is a specialised, monofunctional, subject-specific language in which words or "terms" are
used in a way peculiar to that domain lexically, semantically, and also in some cases, morphologically and syntactically. The study of LSP is a well-established (academic) disci-
60
pline complete with its repertoire of academic departments, journals, books, and so on.
LSP terms - usually nouns - are used to encode different aspects of the knowledge of a
domain including abstract concepts, sense-relations, nomenclature, and process-oriented
or device-related descriptions.
A terminology is a collection of such LSP terms within a particular domain, in which the
terms are defined and the interrelationships between them explained.
The knowledge of a specialised domain evolves according to a life-cycle paradigm, i.e.
inception, currency, refinement and obsolescence of knowledge; this is reflected in the
evolution of the terminology of the domain. Terms are used by experts and technical writers for communicating the knowledge of the domain to other experts, novices and in some
cases laypersons.
Generally, text is used as the medium for this communication and can be classified into a
range of text types as described later. Given the aim of closer cooperation between linguistically diverse nation states, such as those within the EC, and the establishment of a
multi-national corporate culture, specialised knowledge has to be transferred across communication barriers. This transfer takes place through the medium of language, and notably, text. Note that, as the nature of an LSP may be defined at a number of linguistic
levels, the terminology of a specialised domain and the use of terms in a given language
will reflect the lexical organisation of the language, its morphology and its syntax.
banks - the engineering of term banks with clearly defined phases of specification, design,
implementation and testing of terms and their potential utility. We stipulate that the model
should have synergy and reflect the fact that terms are used for encoding (communicating)
and decoding (interpreting) the knowledge of specialised domains. This communication,
primarily text-based, manifests itself in a range of text types (e.g. informative, instructional, persuasive) in different languages for a varied audience comprising experts, novices, students, laypersons, etc.
Within our model, the development of term banks - the organised collection of terms of a
domain for identifiable target users - is simulated as a continual process of term elicitation
and elaboration. This process involves the execution of four consecutive phases (see also
Fig. 19):
Phase I: acquisition - the conceptual organisation of the domain and the creation of a
corpus; the identification of specialist terms in texts, glossaries or dictionaries.
Phase 2: representation - the linguistic description of the term (words/phrases), including identification of grammatical category, category-specific morphological and syntactic information, and linguistic variants.
Phase 3: explication - the elaboration of the term including its descriptive definition
and descriptive contextual use.
Phase 4: deployment - the exploration of the sense relations between the term and
other (pre-stored) terms, including the establishment of cross-linguistic equivalence.
61
The successful execution of each phase of term elicitation is delineated with clearly identifiable data: e.g. the acquisition phase data include an overview of the domain or subdomain and a list of terms together with archival data. The representation phase follows
the acquisition phase and is deemed complete for a term (or a list of terms) with the provision of specified linguistic data. The explication phase involves the procurement of definitions, generally from domain experts, and of examples of contextual use from a carefully
selected corpus of texts. In the deployment phase, both experts and the corpus can be used
to identify and quantify sense-relations such as synonymy and hyponymy.
The term data can be refined by repeatedly executing one or all the phases mentioned
above. As each phase generates its own data. it becomes clear that data associated with
each term can be categorised as acquisitional, representational, etc. These data require
different data structures to be effectively stored in (or retrieved from) a computer system.
Once stored, the requirements of maintaining and updating these data will be different.
The retrieval of the different data items will depend on the given user - e.g. the TWB User
Requirements Study (Chap. 4) identified that a translator will generally require deployment data (foreign language equivalent, etc.), representational data (grammatical information, etc.) and explicatory data (context, etc.).
Terminologist
Terminologist
bilDis
!mmt~ I~[[D
ACQUISITION
Identify and collect specialist documents, lexica and terms
REPRESENTATION
EXPLICATION
Establish definitions and contextual use of terms
DEPLOYMENT
t bilDis
!.Ili~ I~[[D
"I..
Terminologist
Fig. 19: Life cycle model of term elicitation (Ahmad and Rogers, forthcoming)
The different data structures required to encode the data from each of these phases is as
follows. Acquisitional data need 'simple' data structures (records and lists); representational data need data structures with predicative power; explicatory data require network
data structures; and deployment data require network and list structures.
62
II
Unlversitv of Surrey
MATE
~
~
( KonText
) ( Tenn Refiner
( Term Browser
) ( CUstomiser
C_~_L_Qu=-e......:ry____) ( IQueJY
C_9u..:...-it_ _ _ _ _) ( Help
Fig. 20: The University of Surrey's machine assisted term elicitation environment
Using the MATE (Machine Assisted Term Elicitation) environment, developed by Holmes-Higgin (see, for instance, Holmes-Higgin and Griffin, 1991, Holmes-Higgin, 1992,
Ahmad et al. 1990, Ahmad et al. 1991) terminology can be elicited from a corpus of LSP
texts, elaborated, disseminated and retrieved interactively or off-line in a variety of formats. The toolkit comprises:
63
KonText: generates word lists, indexes, KWIC lists, and concordances for use in
acquiring terms and related terminological data from texts, Sophisticated search facilities are provided;
Term Refiner: allows the user to design and create mono- or multilingual term banks.
Automated guidelines provide assistance when data is being entered, ensuring consistency and accuracy;
Term Browser: provides hypertext-like term bank browsing and navigation facilities
for rapid data retrieval;
Term Publisher: prints the contents of term banks in a variety of formats, including
terms lists, dictionaries, and full term bank records;
Customiser: allows each user to set up a personal installation and working profile.
The MATE system is written in QUINTUS-PROLOG, a logic programming language
available on a SUN-SPARCstation, running under the UNIX operating system. The user
interface was written using ProWindows and the bulk terminology data was stored in
ORACLE, a proprietary database management system. Recently, supported by MercedesBenz AG, a smaller version of MATE has been ported onto PC-compatible hardware: the
PC-MATE. PC-MATE is written in C++, a procedural programming language on an mMcompatible PC running under DOS (see Fig. 21). The user interface for PC-MATE was
written using MS Windows 3 and the bulk terminology data was stored in COMFOBASE, a Siemens database system.
ProWindows
Prolog
Windows-3
~
UNIX
C++
DOS
(IBM) PC
64
a) Acquisitional data: typically axiomatic data, comprising the term, archival data
(such as the name of the terminologist or expert who identifies terms and the data on
which the identification was based) and references (or, more precisely pointers) to other
descriptive or relational data exemplifying the term and its usage;
Acquisition Data
entry
>Q)
~
parameter
text
source
country
terminologist
Explication Data
comment
language
date
..
Explication Data
type (del, text)
Deployment Data
type
comment
>~ parameter
related entry
usage
comment
>~ parameter
text
source
b) Representational data: generally descriptive data for categorising the term linguistically (e.g. part of speech, number, gender) and language use data (e.g. abbreviations,
variants, including chemical and mathematical formulae);
c) Explicatory data: descriptive data with the focus on meaning (e.g. definitions and
illustrations of contextual use);
d) Deployment data: data which signify the semantic or knowledge-based content of a
term and its relationship to other terms and their contexts, including foreign language
equivalents, synonyms, etc.
The record formats presented below were created for a multilingual (English, Spanish,
German) term bank of automotive engineering and for a bilingual (Welsh, English) term
bank of chemistry. In both cases, the fields of the record formats have been specified
uniquely in order to meet the needs of translators. The data contained in the record formats was collated from a multilingual corpus of LSP texts (catalytic converter technology
and radiation chemistry), using the MATE (Machine-Assisted Terminology Elicitation)
environment. The record format (only a few fields of which are shown in Fig. 23) can be
specified according to users needs in other domains and language combinations. Examples
of entry records.
DOMAIN:
cat COD
TERMINOLOGIST:
kbw
ENTRY DJU"E:
ll-sep-90
LANGUAGE:
en
COUNTRY:
GB
ENTRY:
AfFratio
GRAMMAR:
GERMAN EQUNALENT:
Kraftstoff-Luft-Verbliltois
SYNONYM:
air/fuel ratio
DEFINmON:
DEFINmON SOURCE:
enGB00487
65
CONTEXT:Tbe dual substrate catalyst was an essential compromise for obtaining simultaneous conversion of HC, CO and NOx over a wide AfF ratio range.
CONTEXT SOURCE:
enGB00481
DOMAIN:
radioactive chemicals
TERMINOLOGIST:
AED
LANGUAGE:
cy
COUNTRY:
GB
ENTRY:
arbrawf
ENGLISH EQUNALENT:
experiment
GRAMMAR:NOUN masc
WELSH DEFINmON:
ENGLISH DEFINmON:
DEFINmON SOURCE:
enGBOOOO3
CONTEXT: Sail yr arbrawf CMN yw cyflwyno yr yoni cywir ac, i ddilyn, mesur sut y caiff ei dderbyn
gao y niwclysau.
CONTEXT SOURCE:
cyGBOOOOI
66
II1II
--r-
....
German terms
English term.
Spanish terms
Term Validation: The systematic verification and validation of terminology, i.e. comments and criticisms made by domain experts, is logged in two ways: a) by regular consultation with domain experts - preferably native speakers who are domain experts, b) by the
terminologist filling in a questionnaire Clog') of the validation process - the 'Term Validation Form' (see Appendix B) in the presence of the expert(s) validating the term. The
form comprises three major sections: i) archival notes (e.g. names of the terminologist and
domain expert(s), the time and place of validation), ii) representation, explication and
deployment data printed on the form directly from the term bank, and iii) comments on
each of the data clusters by experts. This form systematically records under the appropriate sub-headings all the details and suggestions given by experts: conceptual, linguistic
and administrative data. Further controls are implemented as new terms are identified and
elaborated.
Guidelines for the selection of contextual examples (see Fulford and Rogers 1990): Two
principal purposes can be identified for including contextual examples of use in a terminology. The first is to clarify the meaning of a term for purposes of both comprehension
and production by the end user. The second is to establish how the term is used stylistically and grammatically in running text (production only) (see also Sinclair 1988:xv).
Contextual examples of the term in use provide descriptive information for the user, in so
far as the terminologist is recording and following current language use in this aspect of
term elaboration (see also Sinclair 1988:xii for general purpose lexicography). Contextual
examples may also help the translator to distinguish between use of the term in different
text types. A definition is unable to fulfil this need.
67
The selection of contextual examples needs to proceed on a principled basis. Given the
current size of, for example, Surrey's automotive engineering corpus - c. 195,000 words
(British English), c. 161,000 words (American English), c. 126,000 words (German) and
c. 290,000 words (Spanish), a number of examples of contextual use of a term can be
found when an exhaustive search of the automotive engineering corpus is undertaken.
The University of Surrey has developed a set of guidelines designed to help the terminologist identify those contextual examples which are most appropriate to the needs of the end
user. The guidelines are divided into a) 'Introductory' criteria for recommended texts, b)
'Comprehension' guidelines (decoding) based on text-linguistic and semantic considerations, and c) 'Production'
guidelines (encoding) based mainly on grammatical criteria. The division of the guidelines can be shown diagrammatically as in Fig. 25.
Contextual Examples
Comprehension
(decoding)
---------
Text Linguistic
Semantic
Production
(encoding)
Grammatical
There are currently 16 guidelines in all, just under half (n=7) based on text linguistic
measures (e.g. avoid examples containing pronouns, avoid examples containing table
headings, or in text titles); semantic criteria (n=4) (e.g. avoid advertising material or
examples containing proper nouns, avoid examples with two or more technical terms) and
grammatical criteria (n=4) (e.g. if the entry term is in the singular [plural], find an example
of when it appears in the singular [plural] ), and text types to search (n=1) (see Ahmad,
Fulford and Rogers forthcoming for more details).
68
ure 26 depicts the distribution of texts across the respective languages and language varities.
Bt,Ush English
AmetClan EnglIsh
SpaniSh
121
German
II
call~C oorwertars
E'J
mboel~neous
ra
ariu.,. E"gU....
Amerlinn EI't;liah
tMguag.a
Fig. 27: Structure and current size of corpus according to language and sub-domain
(Surrey's automotive engineering corpus)
Text monitoring of this kind allows the user to gauge the balance of texts across languages
and across text types.
Six text types have been identified as representative of the range of texts encountered by
end users. Books and learned journals can be classified as informative, workshop manuals
69
are instructional, and newspaper articles and advertisements are broadly persuasive. One
other informative text type is that of official documentation. By official documentation we
understand patents, legislative articles, regulations and standards. The information such
texts contain is highly descriptive and precise. and is often the first source of terminology
of a new domain. These texts are also frequently found to contain definitions. To maintain control over and identify the differences between texts, 'headers' have been inserted
at the beginning of each text to indicate sub-domain, language, text type and date of publication. A full bibliography of texts is also maintained and is accessible to the user.
A corpus-based approach enables the terminologist to identify, study and record terms in
their natural habitat, and translators, as users of the term bank, to have customised access
to a full set of information about all the terms and related terminological data of a domain.
10.8 Conclusions
The main results achieved by the 1WB project at the University of Surrey may be summarised as follows:
A four-phased methodology has been established for creating and maintaining corpora of
LSP texts and for eliciting terms from such text corpora and validating them and their
associated data with the help of domain experts. This terminology management methodology provides the basis for terminographical work using computer-based resources. Its
artifacts comprise a text corpus and a term bank for use in translation. Our work is therefore of relevance to the terminologist/terminographer and the translator.
70
A prototype termbank of automotive engineering has been created using the four-phased
methodology. The methodology has been refined to optimise progress and to counter language-specific problems which occurred during the creation of the prototype termbank.
The four phases of term elicitation have also benefited from the monitoring procedures
established during the project for quantifying the work of terminologists and from providing them with a Machine-Assisted Terminology Environment (MATE) for managing and
searching corpora and terminological data banks.
71
10.9 Appendix
TERM VALIDATION FORM
TWB
Sub-Domain:
Catalytic Converters
Terminologist
Date:
Place:
Verifier Name
Verifier Address
Term to be verified; its definition, foreign language equivalent and an example of contextual usage
klopfen
[00214
nnt; =Klingen
Premature ignition of the air/fuel mixture in the combustion chamber of an internal combustion engine, causing damage to the engine.
Mithin konnten effizientere Motoren mit hoherer Kompression gebaut werden, ohne
gefahr zu laufen, daB das LuftlKraftstoffgemisch unter hohem Druck zu frilh explosionsartig verbrennt; dadurch enl<;teht das berilchtigte Klopfen oder Klingeln des Motors.
Verifier's Comments on: (Add x or as appropriate)
(a) CONCEPTUAL Data in the Record:
Definition Current
Extra (indicate attached sheets)
Context Current
Extra(indicate attached sheets)
(b) LINGUISTIC Data in the Record:
(c) ADMINISTRATIVE Data of the Record:
METAL is a system which follows the approach of "computer integrated translation"; i.e.
it tries to integrate tools needed in translation and to group them around a machine translation kernel. It follows the same approach as TWB: allow for integration of the software
tools in the process of documentation and translation.
The first prototype was developed in Austin, Texas; product development has been based
in Munich since 1986; the system has development sites in Austin, Spain, Belgium, Denmark, and Germany. Languages treated are English, French, German, Spanish, Dutch,
Danish, and Russian. Others, like Chinese, are being investigated.
The system is described in more detail in Thurmair 1991.
76
As the translation of a term is often dependent on its context, the lexicon lokup must do
a full parse of the input to determine its translation; this holds also for multiword
entries. Users can control this by parameters.
Missing terms must be coded in the lexicon. For this reason a comfortable coding tool
(called Intercoder) has been created.
After the lexicon has been updated, the text can be translated. METAL uses an augmented phrase structure grammar for morphological and syntactic analysis. It is parsed
with an active chart parser using a middle out control strategy and a scoring mechanism
(CaeyerslAdriaens 1990).
The parser produces a canonical interface structure called "METAL Interface Representation" (MIR). It describes the linguistic properties of the input sentence in terms of
tree structures and features; but it has words on its leaves. Therefore it is not a pure
transfer approach (as the original METAL prototype was) and not an interlingua, but a
combination of both, which can be called "Transfer interstructure" (Alonso 1990).
On this MIR representation, transfer is performed which produces MIR structures again
for both simple and complex lexical transfer (Thurmair 1990). And finally, generation
rules produce the proper target text from a MIR input.
This approach allows for easy combination of new language pairs from existing components at relatively low cost
After translation, the text must be postedited. Users are supported here with several
postediting and search-and-replace tools, as well as a special editor. Postediting time is
critical for the machine translation process.
After postediting, the text is refonnatted (i.e. brought into a MDIF format from a PTF
format), and then it is reconverted (i.e. brought into the user's native text processing
format from an MDIF format).
This process can be split into two parts, following a client-server philosophy: Conversion,
text processing, and postediting can be done on a client; translation, which needs massive
computing power, can run on a server in the network. Of course, both client and server can
run on the same machine as well (a standard UNIX platform).
converters
deforrnatting
78
11.2.4 Editor
METAL uses a special editor (MED: Metal editor) for internal purposes. It is based on an
extension of EMACS and is designed particularly for translation and postediting purposes:
It is based on the ISO 8859/1 character set Care was taken to support easy handling of foreign characters.
It is completely transparent with respect to control sequences: As the escape sequenes of a
foreign system could enter METAL (as part of the MDIF files), the editor must not react to
them in a strange way. As a result, even binary files can be edited with MED.
It uses function keys for the most frequent postediting operations (collecting words, moving them around in sentences, etc.), as well as special editing units like "translation unit".
It is not designed, however, for comfortable layouting, text element processing, etc., as it
is assumed that this has been done outside METAL already (in the documentation department). It offers everything needed to (post)edit a text coming from outside, without damaging it by adding additional 1 new editing control sequences.
lexicon
LexIcon
lexicon
11.3.1 Lexica
The METAL lexica are organised according to two major divisions. First, they consist of
monolingual and bilingual dictionaries. Monolingual dictionaries contain all information
needed for (monolingual) processing. METAL uses the same monolingual dictionaries for
analysis and generation; and the monolingual lexica can be used for purposes other than
machine translation (like, in the case of TWB, verification of controlled grammars). Bilingual lexica contain the transfers. The METAL transfer lexica are bilingual and directed
(i.e. the German-English transfer lexicon differs from the English-German transfer lexicon; this is quite natural in the case of l:many transfers (see Knops/Thurmair 1992).
79
The second division of the lexica follows the subject areas. The lexica are divided into different modules to specify where a term belongs. The modules are organised in a hierarchy, starting from function words and general vocabulary, then specifying common social
and common technical vocabulary, and then specifying different areas, like economics,
law, public administration, computer science, etc. These subject areas can still be subdivided further, according to users' needs. This modular organisation does not just allow for
interchange of lexicon modules; it also allows for better translations. Users can specify the
modules to be used for a given translation, and the system picks the most specific transfers
first.
Internally, the monolingual lexica are collections of features and values. Features describe
phonetic, morphological, syntactic, and semantic properties of an entry. The transfer lexica describe conditions and actions for a given transfer entry to be applied. The number of
entries of the lexica varies for the different languages; it lies between 20,000 and 100,000
citation forms.
It must be kept in mind that an MT lexicon entry differs considerably from a terminological entry which is basically designed for human readers (see KnopsfThurmair 1992 for a
comparison). The challenge is to find common data structures and lexicon maintenance
software to support both applications.
11.3.2 Analysis
METAL is a rule-based system. It applies rules for morphological and syntactic analysis.
The METAL rules have a phrase structure backbone which is augmented by features.
Rules consist of several parts. A test section specifies under what conditions a rule can
fire; conditions include tests on the presence or absence of features or structures or contexts. A construction section applies actions to a given structure; they consist of feature
percolations, putting new features to a node, changing tree structures, and producing the
canonical MIR structure for a given subnode. There are other rule parts as well, including
rule type (morphological or syntactic), maintenance information (author, date, last editor)
and comment and example fields.
The grammar itself uses a set of operators which perform the respective actions; it is a
kind of language in itself. The operators are described in Thurmair (1991). METAL grammars comprise between 200 and 500 rules depending on their coverage.
The rules are applied by a standard active chart parser for the different grammars. If the
grammar succeeds it delivers a well-formed MIR structure; if not, it tries a "fail soft" analysis and combines the most meaningful structures into an artificial top node.
There are three special issues to be mentioned:
Verb Argument treatment is always a critical issue, as often a verb can have several
frames which have optional elements. It is difficult and time consuming to calculate...all
possible combinations between different frames and potential fillers. METAL uses special
software to do the calculation. The frames for a verb are specified in the monolingual lexicon in a rather general way (which allows for interchanging this information with other
lexica, cf. Adriaens 1992. Analysis uses sets of morphosyntactic and semantic tests to
80
identify potential fillers for a given verb position (e.g.: verb takes an indirect object filled
with a "that"-clause). Usually there are several candidates for a role filler; the computation
of the most plausible is done by software. Again, all language-specific aspects are part of
the lingware and maintained by the grammar writer; the software only does the calculation
and is language independent.
Another area where METAL uses software support is anaphor resolution. Anaphors are
identified following an extension of the algorithm of Hobbs, taking into account the different c-commanding relations of the different pronoun types. The anaphor resolution is
called whenever a sentence could be parsed. It is also able to do extrasentential resolution.
The anaphor nodes are marked with some relevant features of their antecedents.
With rather large grammars as in METAL, the danger exists that the system produces too
many hypotheses and ambiguities which cause the system to explode. METAL avoids this
danger by applying a preferencing and scoring mechanism which processes only the best
hypotheses at a time. The score of a tree is calculated from the scores of its son nodes and
the level of the rule which was fired to build it. Scoring is controlled by linguistics, by
attaching levels to rules (indicating how successful a rule is in contributing to an overall
parse), and by influencing the scores of trees explicitly in rules (see Caeyers 1990). During
parsing, the scores are evaluated, and only the best hypotheses are processed further.
Robustness is always an issue for a system like METAL. It is tackled at several stages in
the system:
Unknown words are subject to special procedures which try to guess the linguistic
properties of that word (like category, inflectional class, etc.). This is done using an
online defaulting procedure (see Adriaens 1990).
Ungrammatical constructions are partly covered by applying fallback rules for certain
phenomena (like punctuation errors in relative clause constructions); the respective
rules have lower leyels than the "good" ones.
If the parser fails, the system still tries to find the best partial interpretations of the input
clause; this is under control of the linguists: they can apply default interpretations, basic
word ordering criteria, etc.
The result of the analysis is a MIR tree which is defined in terms of precedence and dominance relations, and in terms of (obligatory and optional) features.
11.3.3 Transfer and Generation
This tree is transferred into the target language. Transfer contains three steps: Structural
transfer transforms source language constructions into target language constructions; lexical transfer replaces nodes on the leaves of the tree by their target equivalents; and complex lexical transfer changes both structures and lexical units (e.g. in cases of verb
arguments mappings, or argument incorporation; see examples in Thurmair 1990. Transfer
again creates MIR trees.
These trees are the input to the generation component. Generation transforms them into
proper surface trees, using the same formalism as the analysis component, in particular the
tree-to-tree transformation capabilities. As a result, properly inflected forms can be found
81
at the terminal leaves of the trees; they are collected and transformed into the output
string.
11.3.4 Productivity Tools
In order to support the translation kernel, METAL has developed sets of tools for lexicon
and grammar development and maintenance.
The basic coding tool is called Intercoder. It allows for fast and user-friendly coding of
new entries. It applies defaulting strategies to pre-set most parts of lexical entries; it
presents the entries via examples rather than abstract coding features. As a result, users
need only click on the items to select or deselect them. Internally, the intercoder consists
of a language-independent software kernel; the language-specific coding window systems
are controlled by tables which are interpreted by the software kernel.
In addition to the Intercoder, METAL offers several tools for lexicon maintenance.
Among them are consistency checking routines (does every mono have its transfer and
target entry?), import/export facilities, merging routines which resolve conflicts between
lexical entries, lexicon querying facilities, and others.
For grammar development, a system called METALSHOP has been developed. It allows
for editing, deleting, changing rules, for inspection of the chart during analysis, for rule
tracing and stepping, for tree drawing and comparison, for node inspection, and others. It
also supports suites of benchmark texts and automatic comparison of the results.
These productivity tools are indispensible in an industrial development of large-scale natural language developements. Otherwise there will never be a return on investment for this
kind of system.
82
losophy, because it was available, and because of its embedding into the document
production environment.
In order to be open and to follow standards, the communication between the 1WB system
and METAL is done through standard X.400 electronic mail. For this reason, an X.400
system (based on the results of the CAcruS ESPRIT project) has been integrated in both
1WB and METAL systems. A special X.400 user agent that interacts with the MT-System
has been developed and interfaced with METAL (through "X-Metal"). The role of this
user agent, installed in the MT system side, is to receive X.400 messages from outside,
and to generate replies back with the translated document.
The use of OOA/OOIF to send documents to MT systems guarantees the standardization
of the input and output formats, and it allows a translation to be produced with the same
structure and format as the source text (see Chap. 7 for more detailed information).
As a result, it turned out that remote access has its obstacles and requires additional efforts
12.1 Introduction
The development of fully-automatic high-quality machine translation has a long and chequered history. In spite of the extensive amount of protracted research effort on this topic
the ultimate goal still lies very far beyond the horizon, and no substantial break-throughs
are in the offing.
Today's machine translation (MT) systems have a reasonable performance in very
restricted subject areas, translate texts with a restricted grammatical coverage. However,
most of the existing MT systems cannot live up to the needs and expectations, both in
quality and costs, of an ever growing market for translations. In Europe alone, several million of pages per year are being translated. The future European integration, through the
EC, will certainly lead to an impressive increase of this figure, making translation more
and more a serious cost factor in product development and sales.
The translation of technical documentation, manuals, and user instructions comprises the
bulk of work in most translation departments and bureaus. One of the most striking characteristics of such texts is that there is a degree of similarity amongst this type of text.
Moreover, as we found out in a specially commissioned survey by TWB (Fulford et al.
1990), many of these texts are translated more than once, because new versions of these
texts become necessary as the documented product alters. The original version of the text
is often difficult to locate, and even when traced, pin-pointing the differences between
both versions and the appropriate editing of the translation is a labourious and time-consuming task. There are no tools that effectively support translators in this task and very
often a completely new translation is considered to be a reasonable alternative.
The frequent translation of documents which are similar in content clearly indicates the
need for a translation aid system which makes previously translated texts or parts of texts
directly available, without the user having to expend much effort. We claim the Translation Memory to be such a system.
Translation Memory is more than a system that just stores and retrieves texts. It collects
and applies statistical data from translated text, builds stochastic models for source and
target language (SL and TL), and for the transfer (the translation) between these SL and
TL models. The Translation Memory system displays a 'cumulative learning behaviour':
once the stochastic models have been developed on a small sample of texts, the system's
performance improves as it is exposed to more text This exposure helps the stochastic
models to automatically expand their coverage. Due to this approach the system can not
only retrieve/re-translate old sentences, but can even translate sentences that never have
been input before, provided that their components (words or phrases) have already
occurred in previously stored texts. As a result, system performance is dependent on the
scope and quality of the existing database and is expected to improve as the database
grows.
84
In the following we will discuss the state of the art in statistical machine translation and
will then present our approach. Next we will describe the implemented system followed
by first results of the system in use.
We will conclude this chapter by briefly discussing future work on the Translation Memory.
B5
3. arranging the words of the target fixed locutions into a sequence that fonns the target
sentence.
"Fixed locutions" may be single words as well as phrases consisting of contiguous or noncontiguous words. Although the papers present many fruitful ideas with regard to stage 1
of the process, they do not (yet) describe to the same extent their ideas and solutions for
the further stages.
An important aspect of Brown et al. 's approach is that all the statistical infonnation is
extracted fully automatically from a large bi-lingual text corpus. Brown et al. have argued
that it is possible to find the fixed locutions by extracting a model that generates TL words
from SL words. Such a model uses probabilities that describe a primary generation process (Le. production of a TL word by a SL word), a secondary generation process (production of a TL word by another TL word), and some restrictions on positional discrepancy of
the words within a sentence. This gives rise to a very large number of different probabilities, and the automatic extraction of these probabilities seems to be computationally very
expensive and requires highly advanced parameter estimation methods and a very large
amount of corresponding translated training text. For the construction of the contextual
glossary for stage 2, one would need all these probabilities plus, of course, some new
ones, all of which result in the production of glossary probabilities.
Despite the large number of unsolved problems, first experiments of French to English
translation of the IBM group have shown promising results. With a 1,000 words English
lexicon and 1,700 words French lexicon (the most frequently used words in the corpus)
they estimated the 17 million parameters of the translation model from 117,000 pairs of
sentences that were fully covered by the lexica. The parameters of the English bigram language model were estimated from 570,000 sentences from the English part of the corpus.
In translating 73 new French sentences from the corpus they claimed to be successful 48%
of the time. In those cases where the translation was not successful, it proved to be quite
easy to correct the faulty translation by human post-editing, thus all in all reducing the
work of human translators by about 60%.
86
the digram probabilities P(Sj I Sj), that the word Sj follows immediately after the word
Sj' and
the trigram probabilities of the form P(sll s;. sj), that the word sl follows immediately
after the sequence Sj, Sj.
These probabilities can be estimated by counting the respective relative frequencies from
the trained input sentences. Note that a model which is defined by trigram-probabilities is
in fact a Markov model of order three. The digram-probabilities and the single probabilities define Markov models of order two and one respectively.
I
I
I
/
iii
II
I
I
(MunerB)
~~
87
Thus the language models of the Translation Memory are in fact models that are an integration of Markov models of different order. The connections between the models of different order are equivalent to the application of two rules:
(I) After a transition in a Markov model of order m, the process changes state to the
Markov model of order m+ I without producing additional output, provided that there is
a Markov model of order m+l. This state in the Markov model of order m+l is
uniquely defined by the transition in the Markov model of order m.
(2) If a proper path in a Markov model of order m (that corresponds to a given sentence,
for example) cannot be found for reasons of lack of training data, the process changes
state to the Markov model of order m-l. The state in the Markov model of order m-l is
defined by cutting off (deleting) the first word of the string that defines the state in the
Markov model of order m.
Figure 30 shows the principle of a model that integrates a Markov model of order three
with a corresponding model of order two and one of order one (The figure only shows the
transitions, not the probabilities attached to them). The order-one model indicates that any
transition from some word in the lexicon to any other word in that lexicon is possible in
this order model. The special symbols n and ~ are markers for begin and end of string.
The dashed arrows indicate some of the possible state transitions that do not produce output but are used to change the order of the model.
If we would have to find the path in this integrated model that corresponds to the string
"n die Mutter liebt der Vater
~",
we would find that the word "der" cannot be produced by a transition to any successor
state of the order-3-state defined by "Mutter lidzl.... Thus we have to decrease order and
change to the state in the Markov model of order two that is defined by the word "liebf'.
We find once more that "der" cannot be generated by any transition from the order-2-state
"liebt". Therefore we change state to the only state of the order-one Markov model. Now,
there is a transition in the order-one-model, that produces "der", because this word is contained in the lexicon. After this production the process automatically changes state to the
order-2-state defined by the word "der". This state has a transition to produce "Vater" and,
moreover, leads to a non-productive state transition to order-3-state "der Yala". Since
there is no corresponding transition from this state to produce "W', we have to reduce
order again and go back to order-2-state "Vater", find from there an order-2-transition to
"W', and end with an order transition to order-3-state "Vater Jl".
88
that the SL word s[. which meets the contextual conditions defined by W[s/> S}, is translated by the TL word tj, provided that tj complies with the TL contextual conditions
defined by W[tj. T}, where S denotes the SL sentence we are translating and T the corresponding TL sentence we are generating as a translation for S. This first form is valid for
the most frequent case that one TL word corresponds to one SL word. The second form of
probabilities. treating multi-word correspondences, consists of entries
P(tj. W[tj, T}, ... ttoW{tk. T} I s/. W[s/> S}, ... , sn,W[sn. S])
that the SL words s/... ,sn are translated by the TL words tj. ... tto provided that the
respective contextual conditions are satisfied. Since SL and TL units do not have to contain the same number of words, we are thus in a position to appropriately align SL and TL
sentences of unequal length.
The contextual conditions of a word s/ or tj are defined by zero, one or two predecessor
words. Thus the glossary establishes connections between the single, bigram and trigram
probabilities of the connected language models.
Figure 31 illustrates the transfer in a rather simplified model. Having an order-2 model for
both SL (English) and TL (German), some of the transfer probabilities are indicated by the
dashed arrows (or rather, that such probabilities are contained in the transfer glossary.
since we did not indicate quantities on those arrows)Generating a TL sentence from a SL
sentence thus amounts to solving a complex constraint satisfaction problem. This will be
relatively easy for sentence pairs that the Translation Memory has been trained with in the
past. since the "correct" applicable probabilities are already contained in the complex
translation model. But the same procedure can be used to translate completely new sentences (i.e. strings the Translation Memory has not been trained with). Given a welltrained model, these translations will turn out to be quite acceptable. although a human
translator might need to post-edit them.
(2)
(3)
(4)
sth: the standard program by which a user can interactively make translations
or train/update the system's databases;
sthback: a program for batch mode translation of texts;
sthtra: similar to sth, but comprises interactive translation only;
sthlearn: similar to sth, but comprises training/data aquisition only.
89
" ~''---'"
,,
',,
,,,
-';\
\
\
\
\
\
\
Since the databases containing all the statistical infonnation are of vital importance for the
system, extreme care was taken to secure them against data loss, owing to unexpected system tenninations.
In the following, we will briefly discuss the 5th programm, since it shows the basic functionality (translation and training) of the system.
When starting the program, the user uses a window in which he or she has to specify some
basic infonnation: the databases to use for training/translation; the SL and 1L textfiles he
or she wants to operate on. If needed, the user can get access to a further parameter settings dialogue, in which he or she can set parameters both concerning the editor (initial
cursor position, window width, scrollbars, highlighting of non-input-focus areas, and
automatic selectionlhighlighting of the next sentence after a call to the tranlation or data
aquisition routines) and can access infonnation concerning the behaviour of the translation routine (whether to prefix unkown words with "*", whether the program is allowed to
shift the order of SL or 1L model, the number of search steps is maximally allowed, and
whether or not it should ignore context in the transfer phase, thus providing the user with
the possibility to generate a poor, but extremely fast word-by-word translation) and thus
the quality of the translation. All these settings can be saved and be used as defaults for
future program calls.
90
--- - --
-p----------
t!t.'-.~ ..
II" ...
~,
'
I
11 ,11 ..
'J;W:nt..:.V~.,!t. , . f .. p
11""'1,1 , '\.,-
't/'
II,"
II
------ - -- --
u:.rll.'l
,I,II ,.!,'.
III odditioa. whc!... roquired. u.. "",tn1 diJfcclliallocl. ill aaas!a caoo
...d u.. r_ diJfc.. tiallocl. AS[) uo auroma.oi.calIy addod.
Tha 4MA nc is .lIrooially-hydnoulically coatt<>llood in 1 sbifr nO!'"no partidpaaoa o( driv. 15 r~.d..
After this confirmatory dialogue, the user interacts with an editor (see Fig. 32) that is split
into two areas, each of which contains the chosen SL or TL textfiles respectively. The editor is WIMPS based and has a number of buttons. With the button "next sentence" the user
can selectlhighlight the next sentence from the actual cursor position in that area.When a
sentence is selected and the "translate" button is pushed, the system generates a translation for this sentence that will be inserted in the text of the other area at its cursor position.
The other buttons have standard editorial functionality. The "Parameter" button leads to
the additional parameter settings dialogue discussed above. "Cancer' terminates the program.The "Learn" button can only be used when a string (not necessarily a sentence) is
selected in both text areas. This button is located outside the above areas because the data
acquisition routine performs parameter estimation in both directions, thus building and
refining - apart from the integrated models for both languages - a transfer glossary for
both directions. Thus, the Translation Memory can perform translations in both directions
on the basis of the very same databases.
When the "Learn" button is pushed, the data acquisition routine first refines the data models for the SL and TL on the basis of the changed relative frequencies. If the data acquisition routine encounters unknown words it asks the user whether it should store the words
in the database.
91
Fr--- ~ ---~
UgD
Die
wroDg
The
4MATIC
4MATIC
ist
ein
is
Vierrad
four
AntriEbssystem
wheel
bei
permanent
rau
wheel
drive
After the refinement of the language data models, the routine has to estimate the transfer
parameters. To be able to do this the system has to align the respective selected strings. To
the extent that it has information on previous aligments of words occurring in the strings,
the data acquisition routine can do this automatically. Where this information fails, the
routine interacts with the user in a so-called alignment dialogue (Fig. 33), in which the
user can indicate both single-word and multi-word correspondences. Before updating the
databases the routine presents the user with a confirmation dialogue (Fig. 34)that shows
the unit correspondences in both translation directions. The user can confirm these, in
which case the databases will be updated accordingly, or the user can indicate that the correspondences are wrong, in which case he or she will again be presented with the alignment dialogue. During the execution of the program the database updates have a
temporary status. Only when the user wants to exit the program these updates are made
permanent. Thus, the user has the possibility to "undo" the data aquisition of a given session. Furthermore, in case of an unexpected system termination the databases will not be
corrupted, something which is of vital importance for a stochastic system.
92
--
-- - - - - -
.....
__ .
Ii
Ii
J"Dio" 0)
'<MAne-OJ
..... (3)
er(.)
_C5lI(6) )
'~(7)
.: (8)
etC-W (14>
~("l
'aulOlDl.'EiId:I.' (16)
'~
..
(11)
'"TIwI"(I)
....... ne-(2)
'I>" (3)
'I' (4)
[ 'Iow" (5) I [
[ 'drivo" (8) I
"wbolo" (10)
(9)
~ ': (11)
.... t
.,at'
I .~ (7)
)I.>
. : (II)
( 'if (12] ) ""I""od.' (13)
','(14>
.... (15)
~ ',' (4)
.... [ 'Iow" (5) I [ '.' (6) I ......... (7)
. . . . "tIio"(I)
_ .... 4M.<TIC" (2)
.... 'iJr" (3)
~ ..... Colo)
(7) . . . . (5) [ ' .' (6) J
. . . . '''''~' (7)
. . . . . . . (9) [ 'o!.m' (10) )
> .: (8)
. . .. '804uf" (15)
-=
.......... ( 19)
.......rdotridr (20)
:> An tNI:f (l B)
_ .. "Wud" ("22)
.> .utom..-liacb' (16)
~
correspondel1l:es corrett?
93
94
source and target language words/word groups should be done in a way that would yield
only acceptable translations when the word/word group should occur in another environment
One problem was, that the Translation Memory in its present implementation can only
handle so many secondaries to a "kernel" word, ie the length of a phrase to be used as a
undividable unit is restricted. In some cases a sentence had to be changed slightly in order
to obtain one that would not force the user to encode it in a meaningless way. The new
sentences probably were not quite as polite as the original ones but still represented the
same meaning.
In training the Spanish-English pair, it was often very easy to establish one-to-one correspondences. This situation was pleasing but a note of caution is due here. It leads one to
encode the sentence or phrase with one-to-one correspondences right away before considering which words it would be sensible to encode in groups first. This leads to some severe
errors later when actual translations are made. Some adjectives just occur often with certain words and some nouns often combine to make a composite noun. If these adjectives
and nouns are not at first encoded as fixed combinations, they tum up in the wrong order in
the translations or, even worse, a different word is chosen which usually should not occur
in this particular combination. Thus it is better to first encode phrases as phrases and later
break them down into smaller parts if necessary. This will ensure better quality of translation.
12.5.3 The Optimal Translation Parameters
In order to determine the translation behaviour of the system, and the degree to which it is
influenced by its parameters, two fairly extreme sets of parameters were tested. The idea
was to test the translation quality of the database on the same text(s) and look for changes,
newly introduced errors etc. with the different parameter sets. When doing test translations
in between the training of two texts, the following two sets of parameters were applied:
Simple:
simple method,
no check of neighbours,
max length of list: 20,
no context reduction,
max number of search steps: 2000.
Complex:
standard method,
check one neighbour,
max length of list: 30,
reduce context once,
max number of search steps: 5000.
The "simple" parameters represent a fairly limited search space, i.e. a short list, and comparatively small number of search steps. The complex parameters have a larger search
space, and additionally more context information to check.
95
English-German
For the pair English-Gennan only one test text was used. It contained sentences which
were almost entirely taken from the phrases of the annual report of an enterprise but were
a little modified and contained some unknown words.
Looking at the simple parameter translation, we see that the quantity of translation
improved: that is to say, more of the unknown words were translated with growing size.
On the other hand, there was a decrease in the quality of the translation.
One problem is the use of the detenniner. In English there is only "the" and "a", but in
Gennan there are Older, die, das, dieser, diese, dieses, ein, eine" and so on. Therefore, with
the growing size of the database the use of detenniners became more often incorrect.
Comparing the translations under simple parameters with those under complex parameters, word order seems to come out with a higher degree of correctness under the simple
parameters.
Spanish-English and Spanish-German
For the pairs Spanish-Gennan and Spanish-English, two test texts were employed. The
first one ("sptestl.txt") was like the test text for English-Gennan. The second test text
("sptest2.txt") was a letter containing words, phrases and sentences from all eight texts
listed above. It contained at least one sentence or phrase from each text.
By and large, there were no big differences between the translations offered under the two
different sets of parameters. But each set has its own virtues and drawbacks. Under the
simple parameters the syntax followed that of the source language, whereas the complex
parameters often turned out a sentence structure that was more appropriate in the target
language. The complex parameters, for example, were able to handle the distance between
a modal verb and the main verb that we often find in Gennan sentences. On the other
hand, sometimes the complex parameters could not find a rather simple word like "hemos
= we have" which had occured often enough in training, although maybe not in this particular combination.
So while the simple parameters sometimes tum out sentences that are a little jumbled, the
complex parameters tend to be "overcautious" and rather tum out nothing at all for seemingly simple words.
As the training proceeded, certain changes appeared. A rather complex phrase "Las actas
fueron levantadas por Sr. Meyer" (Gennan:"Fiir die Protokollfiihrung war Herr Meyer
verantwortlich.") which had been trained with the first text was forgotten as soon as the
second text was trained. It probably was too complex to be recognized even under the
complex parameters. Prepositions and articles tended to change as more texts were trained
which, of course, is due to the quantitative approach that is employed in the TM. So "an
order" became "a order", "at the moment" became "by the moment", and so forth. This
happened in both language pairs respectively. These "mistakes" will probably have to be
detected by a grammar checker.
96
Functional Improvement
The lack of morphological and grammatical information is a serious limitation of the system in its present state. The relevant knowledge about the grammatical structure of sentences is already available, (see Chap. 16). As the cohesion of words within phrases, e.g.
within a noun phrase, is much stronger and thus statistically much more significant than
the cohesion at phrase boundaries, a restriction of training to phrases only could lead to a
considerable reduction of data base size with very little loss of performance. The integration of morphological information can provide a fall-back position for the translation
where the inflected word is not known. In the follow-up project we will try to follow one
or more of these strategies of optimization.
97
First, since most potential users of the system (both translation bureaus and freelance
translators) work on PC platforms. the Translation Memory will be ported to DOS and be
integrated into a standard word processing environment. For this we envisage an integration into Word for Windows. We will also use this opportunity of porting the system to MS
Windows for redesigning the user interface of the system. especially to ease the process of
data-acquisition.
Second. since well trained Translation Memory databases are expected to accumulate up
to Gigabytes of probabilistic data, we think: that it is important to look for ways for compressing this information considerably. notwithstanding the fact that cost of external computer memory is falling steadily.
Third. although the synthesis of target language sentences is quite reasonable in many
cases. the percentage of reasonable productions decreases as SL and TL are more different
in the way they organize their word order. The linguistic issue of so-called long distance
dependencies plays here a role as well. To improve the quality of translation we want to
look into possibilities to integrate information on these issues into the models. As a further
help to the user the system might provide him with translation alternatives from which he
then can choose the (most) adequate one.
All these issues were brought into the proposal for a continuation of the TWB project in
the new phase of ESPRIT.
en...,ge:
sp_en:
en_sp:
sp...,ge:
ge_sp:
.com:
68096
.dbm:
30720
9208
2226
68096
26624
9280
1525
68096
30720
10112
1877
68096
26624
9480
1516
68096
32768
9792
1893
68096
32788
9296
2230
en_sp:
sp...,ge:
ge_sp:
172544
172544
172544
47104
31264
4540
65536
33632
5934
63488
31288
.nod:
.wor:
en...,ge:
.com:
172544
.dbm:
.nod:
63488
31192
.wor:
6326
172544 172544
53248 67584
30592 34136
4615
5849
sp_en:
6383
98
311808
311808 311808
311808
311808
311808
.dbm:
90112
67584
92160
75776
81920
.nod:
57568
55504
61208
56488
92160
59920
57064
.wor:
10128
7193
9613
7103
9679
10079
debemos("contar")("con")
("a")("slight")decrease
("una")("ligera")disminucion
("in")production
("en")("la")produccion
=
Then some phrases and words were trained on their own:
a)("we")must
debemos
b) with
con
c) slight
ligera
decrease
disminucion
E: Please let us know the maximum quantity you can supply immedately.
G: Bitte teilen Sie uns die grB6te Menge mit, die Sie sofort liefem kBnnen.
First round:
Please
Bitte
let("us")("know")
teilen("Sie")("uns")("mit")
("the")("maximum")quantity
("die")("grB6te")Menge
("you")("can")supply("")immediately =
=
Then:
a) maximum
grB6te
b) quantity
Menge
c) supply
liefem
d) immediately
sofort
d)konnen
can
Vds.
Sie
conocen
kennen
("la")empresa
("die")Firma
("desde")("hace")tiempo
("seit")("einiger")Zeit
=
Then:
a) empresa
Firma
b) tiempo
Zeit
99
13.
101
implemented during the second phase of the project. includes additional information categories considered to be of particular relevance in the translation context: transfer and synonym comment, encyclopaedia, hierarchy, and word family.
As the main burden of developing and implementing the extended termbank version is in
connection with the integration of transfer comments and encyclopaedic information,
these categories will be discussed at greater length.
dt Regelung (Regelkceis)
dt Regelung (Regelkceis)
dt Steuerung (Steuerkette)
Wird im englischen Text eine der Mehrwortbenennungen durch die Kurzform control ersetzt, so kann es bei der Ubersetzung ins Deutsche zu einem Transferproblem kommen. Aus dem (Text-) Zusammenhang muB erschlossen werden, ob eine
Regelung oder eine Steuerung gemeint ist
(-> EU: REGELUNG UND STEUERUNG DER GEMISCHBILDUNG BEl
OTTOMOTOREN) 1m Falle von control system ist entsprechend zu kliiren, ob es
Fig. 35:Transfer comment example ("control system")
What is crucial here is the distinction between closed loop (or feedback) control (dt. Regelung) and open loop control (dt. Steuerung). In a German text it is either Regelung nr
Steuerung; therefore, the translation into English should create no particular problem. This
is quite different, however, when translating from English into German. In English texts,
the short form control is used quite often without any precise indication of the type of control (open or closed loop). The direction-specific transfer comment provides information
102
to help the translator, first to become aware of the problem, second to determine the
intended reference, and third to select an appropriate equivalent
Stylistic discrepancies between the two languages may also call for a transfer comment.
The English term stoichiometry, for instance, does not belong to the same part of speech
as its German transfer equivalents. An appropriate translation may therefore require an
extensive restructuring of the original phrase.
TRANSFER COMMENT related to the Englisch terms stoichiometry, lambda (simplified
version) (cf. Fig. 36):
Transferliquivalente:
eng!. stoichiometry
dt.
dt.
dt.
stOchiometrischer Punkt
bei stOchiometrischem Mischungsverhliltnis,
bei einem Luft-Kraftstoff Verhliltnis
von lambda = I
Der Terminus stOchiometrie gehort zur Fachsprache der Chemie und wird
im Deutschen in der Fachsprache der Katalysatortechnik nicht als Aquivalent fUr stoichiometry verwendet.
Ubersetzungen auf der Basis von lambda = I konnen allerdings grammatikalische oder stilistische Probleme aufwerfen, da lambda = I nicht modifiziert werden kann. Adverbiale Modifikationen wie in
(1.1)
(1.2)
time
* etwaslleicht lambda = I
* kurzzeitig lambda = I
103
problems discussed in a transfer comment arise from the specific translation direction. A
transfer comment is related to a source term, but it is about the transfer step and the correct
use of the transfer equivalents. For this reason, it is often easier to explain certain transfer
problems in the target language.
In general, transfer comments tend to be quite heterogeneous, which is hardly astonishing
considering the fact that they are about terms in relation to the interacting conditions and
complex problems of translational text processing. Transfer comments draw on various
types of terminological information, especially on meaning definitions and usage, and
they often contain references to encyclopaedic units (see below) in order to direct the
user's attention to relevant subject information. In many cases, therefore, one particular
comment can assist the translator in solving different transfer problems. Because of this
varied and multi-faceted nature of transfer comments, their production requires a careful
coordination of the various types of information contained in the termbank:.
13.4 Encyclopaedia
The translational relevance of the encyclopaedia derives from a close interaction between
term-oriented encyclopaedic information and other types of terminological information,
such as meaning definitions, grammatical properties, collocations, and conditions of use.
Encyclopaedic units are written with a view to the special needs of translators; they
embody terminologically relevant information in a concise and customized way. It is, in
fact, the interplay of both types of information - domain-specific and language-specific that makes the encyclopaedia a particularly useful instrument for the translator.
Other than in a textbook, the presentation of encyclopaedic information in connection wi~
a term bank is not a goal in itself. Rather, the information is selected, organized and presented with a view to the specific terminological problems of text comprehension and production. One major function of encyclopaedic information, in this context, is to
supplement the meaning definitions of terms by illustrating particular aspects of the subject area under consideration, thus placing terms in a wider context, without which an adequate interpretation would be difficult, or even impossible.
Translators are not confronted with terms in isolation. The terms they are dealing with
occur in texts, where they are bound together by cohesive ties on the basis of their participation in a common knowledge frame. Some of the terms which are in a frame relation to
stOchiometrisch are given below together with their meaning definitions:
stOchiometrisches Luft-Kraftstoff-Verhliltnis (stoichiometric air/fuel ratio): Ein stOchiometrisches Luft-Kraftstoff-Verhaltnis ist das fUr die Verbrennung ideale Verhliltnis
von Kraftstoff und zugefUhrter Luftmenge. Es liegt vor, wenn 1 kg Kraftstoff mit 14,7
kg Luft gemischt wird.
Luftverhliltnis Lambda (air ratio of lambda): Das Luftverhliltnis Lambda ist das Verhaltnis zwi-schen der tatsachlich dem Kraftstoff zugefUhrten Luftmenge Lund der fUr
die vollstlindige Verbrennung des Kraftstoffs erforderlichen Luftmenge Lth (theoretischer Luftbedarf).
stOchiometrischer Punkt (stoichiometry): Der stOchiometrische Punkt ist erreicht,
wenn fUr das Luftverhliltnis Lambda gilt: Lambda =1, d.h. wenn die fUr die vollstlindige Verbrennung des Kraftstoffs erforderliche Menge Luft zugefUhrt wird.
104
STOcmOMETRISCHES LUFf-KRAFfSTOFF-VERHALTNIS
[fett; ideales Mischungsverhliltnis; Lambda; Lambda = 1; Lambda> 1; Lambda
105
The encyclopaedic unit STOcHIOMETRISCHES LUFf-KRAFfSTOFF-VERHALTNIS provides the required additional infortnation. It sheds light on the interpretation of
terms by presenting the whole cluster of thematically related terms within the relevant
knowledge frame.
In addition to and beyond the semantic exploitation of the factual information given, an
encyclopaedic unit can be useful in that it implicitly provides terminologically relevant
linguistic information about, say, grammatical properties and appropriate collocations
(e.g. dem Kraftstoff Luft zufiihren; den stHchiometrischen Punkt einhalten; vom stHchiometrischen Punkt abweiehen; die Einhaltung des stHchiometrischen Punktes), and
about the actual use experts make of terms when conveying technical knowledge.
GEMISCHREGELUNG
[abmagem; anfetten; Gemischregelung; Katalysatorfenster; Lambdafenster; Lambdaregelung; Restsauerstoffgehalt; Sauerstoffanteil; Totzeit]
Zur Einhaltung des STOcHIOMETRISCHEN LUFf-KRAFfSTOFFVERHA.LTNISSES findet beim Drei-Wege-Katalysator eine Gemischbzw. Lambdaregelung statt (REGELUNG UND STEUERUNG DER
GEMISCHBILDUNG BEl OTTOMOTOREN). Mit Hilfe eines MeBfiihlers, der LAMBDASONDE, wird dabei der Sauerstoffanteil im Abgas
(RegelgrliBe) vor Eintritt in den Katalysator gemessen. Der Restsauerstoffgehalt ist in starkem MaBe von der Zusammensetzung des Luft-Kraftstoff-Gemisches abhiingig, das dem Motor zur Verbrennung zugefUbrt wird.
Diese Abhiingigkeit ermliglicht es, den Sauerstoffanteil im Abgas als MaB
fUr die Luftzahl Lambda heranzuziehen. Wird nun der stHchiometrische
Punkt (Lambda = 1; FiihrungsgrliBe) iiber- oder unterschritten, gibt die
Lambdasonde ein Spannungssignal an das elektronische Steuergerfit der
Gemischaufbereitungsanlage. Das Steuergerat erhiilt femer Informationen
iiber den Betriebszustand des Motors sowie die KUhlwassertemperatur. Je
nach Spannungslage der Lambdasonde signalisiert das Steuergerat nun
seinerseits einem Gemischbildner (Einspritzanlage oder elektronisch
geregelter Vergaser), ob das Gemisch angefettet oder abgemagert werden
muB (vermehrte Kraftstoffeinspritzung bei SauerstoffUberschuB, verminderte bei Sauerstoffmangel). Da vom Zeitpunkt der Bildung des Frischgemisches bis zur Erfassung des verbrannten Gemisches durch die
Lambdasonde einige Zeit vergeht (Totzeit), ist eine konstante Einhaltung
des exakten stHchiometrischen Gemisches nieht mliglich. Die Luftzahl
Lambda schwankt vielmehr in einem sehr engen Streubereich um Lambda
= 1. Dieser Bereich wird als Katalysator- oder Lambdafenster bezeichnet
und liegt bei einem Wert unter 1%.
In zwei Fiillen wird die Gemischregelung abgeschaltet: zum einen nach
Fig. 38: Encyclopedic entry example (2)
106
The encyclopaedia is constructed as a modular part of the tennbank accessible both from
within tenninological entries and from the outside. The infonnation presented is broken
down into encyclopaedic units of manageable size describing a particular aspect of a given
domain; compare the encyclopaedic unit GEMISCHREGELUNG.
Each encyclopaedic unit consists of a well-motivated encyclopaedic header (or title), an
alphabetical list of encyclopaedic tenns in square brackets for whose contextual understanding it is relevant, and a free text The link-up between tenninological entries and the
encyclopaedia is established by means of a many-to-many relation between tenns and
headers; that is, one unit refers to several tenns, and the same encyclopaedic tenns can be
covered by more than one unit. Characteristically, the encyclopaedia provides infonnation
only where infonnation is needed. That is, it neither caters for all the tenns in the termbank, nor does it cover every single aspect of the subject area under consideration. For
this reason, links to the encyclopaedia are only established for tenns for whose translational processing the intended user might need additional encyclopaedic infonnation.
Encyclopaedic units need to be organised within larger knowledge structures. A structure
suggesting itself from a traditional point of view of classification is a hierarchical one. But
such an approach is faced with a serious problem. Depending on the angle from which a
subject area is looked at, it presents itself with a different structural organisation. When
viewed from one perspective, a particular unit may seem to be subordinate to others, and
superordinate when looked at from a different point of view. What at one time seems to be
closely related can at others be wide apart In this sense, any subject area is multidimensional, and this should be reflected by its encyclopaedic structure. A rigid hierachical
structure does not meet this requirement.
The links between thematically related encyclopaedic units established by means of their
headers (in capital letters) provide the basis for an alternative approach. Starting from any
unit, the user is able to access all other units, or a selection of them, whose headers occur
within this unit either contextually or as explicit references, e.g. the headers STOCHlOMElRISCHES LUFT-KRAFTSTOFF-VERHALTNIS, LAMBDASONDE and REGELUNG UNO STEUERUNG DER GEMISCHBILDUNG BEl OTIOMOTOREN in the
encyclopaedic unit GEMISCHREGELUNG. Exploiting these header-links, the user can
thus move along an indivual path to create an encyclopaedic grouping of units reflecting
the specific perspective from which the encyclopaedia is accessed, and providing an individual answer to individual infonnation needs.
With the initial unit GEMISCHREGELUNG as its focal point, for instance, the ensuing
encyclopaedic grouping spreads out to embrace more and more units, containing general
or specific infonnation, as chosen and specified by the user through the headers. In this
way subordinate, superordinate, and coordinate units are grouped together fonning a tailor-made overview on individually selected aspects of the issue in question.
An encyclopaedic grouping represents a dynamic structure containing the encyclopaedic
infonnation which is of relevance in the current retrieval situation. Starting from a given
tenn and searching for individually needed subject infonnation, an ad hoc organisation of
the relevant units is created via the flexible interplay of encyclopaedic tenns, headers, and
units. Such a dynamic structuring of encyclopaedic units through freely generated groupings is a reflection of the multi-faceted and multi-dimensional thematic make-up of a subject area.
Checking natural language for errors can be subdivided into several levels of complexity.
A well known kind of checking is the conventional word-based spelling checking. Nearly
every text processing system has an integrated spelling checker (differing, however, in
quality, especially where languages other than English, French, and German are concerned). Only a large dictionary is needed against which the text can be matched. But not
all spelling errors, nor any grammatical errors or stylistic errors can be found out with
these checkers.
Some progress has been made in the last decade to cover this lack of checking: Special
dictionaries have been developed to solve misleading spelling, statistical algorithms have
been used to give the author information about word, sentence length and readability
score, new algorithms have been found to check with a minimum of effort sophisticated
mistakes and last not least parsers have been developed to check grammatical mistakes
which could not be checked up to now.
In order to deal with the different kinds of errors found in texts, several layers of proofreading tools are used in a cascade in the TWB project:
word-based spell checking in languages where no spell checker of acceptable quality is
available (-> Greek, and Spanish);
extended spell checking, i.e. context-sensitive checking as an intermediate between
word-based and grammar based checking (German, and to some extent English)
simple grammar checker for detecting errors in noun phrase agreement
elaborated grammar and style checker for preparation of documents for automatic
machine translation
The following chapters deal with these aspects of proofreading and give some insight into
language-specific problems.
IS.
delete grapheme
insert grapheme
exchange graphemes
reorder graphemes
Each correction strategy generates a certain syllable type. Given the fact that each pool is
consistent with its syllable typology, each new generated candidate must belong to a
restricted subset of word pools. From each word pool there is a link to the corresponding
word pools related to the four possible corrections for the first syllable; then the correction
strategy is applied to the second word syllable and so forth.
The system generates only syllables that are permitted in the language for a given position.
Syllables consist of the possible combination of:
initial cluster (present or not)
vocalic core
coda (present or not)
Figure 39 shows the architecture of the Spanish Speller.
111
Lists
initial clusters
core (vowel) clusters
final clu ters
Look up in the 17
syllable structures
correction
strategy
Spelling mistakes made by native speakers are mainly typing errors. This can be attributed
to the simple syllable structure of the language and to the fact that Spanish is spelt phonologically. With the exception of some "b/v" and "-/h" orthography cases, misspelling
errors depend on typing skills and mechanical factors: key disposition (adjacent keys) and
simultaneous keystrokes. The most common mistake is adding a letter by inadvertently
pressing the adjacent keys.
The TWB Spanish Spell Checker when measured against other Spellers (Proximity, Word,
Wordperfect) upon the same documents, offers a higher correction accuracy. Accuracy is
defined in terms of the average of word candidates offered to misspellings.
112
Commercial spell checkers presuppose correctness in the first word graphemes, therefore
whenever this is not the case, the amount of correction candidates may grow considerably
and may not even contain the right correction.
Therefore, not surprisingly, better results are obtained with a spell corrector that supports
the phonological structure of a language. Given the fact that Spanish has phonologically
oriented graphemics and an easy-to-handle syllable structure, spelling correction can be
done with a syllabic approach. Moreover, given the nature of the data (syllable inventory)
and the algorithm; there are spin-off applications of this approach:
hyphenation
speech recognition support
OCR recognition support
The present implementation runs on SUN Sparc; it works with 220.000 word forms organised in 5712 pools according to their syllable typology; the syllabic discrimination is made
upon 17 different syllable types.
IS.2.1 Motivation
As indicated by Maurice Gross in his COLING 86 lecture (Gross 1986), European languages contain thousands of what he calls "frozen" or "compound words". In contrast to
"free forms", frozen words - though being separable into several words and suffixes -lack
syntactic and/or semantic compositionality. This "lack of compositionality is apparent
from lexical restrictions" (at night, but: *at day, *at evening, etc.) as well as "by the impossibility of inserting material that is a priori plausible" (*at {coming, present, cold, dark}
night) (Gross 1986).
Since the degree of 'frozenness' can vary, the procedure for recognizing compound words
within a text can be more or less complicated. Yet, at least for the completely and almost
completely frozen forms, simple string matching operations will suffice (Gross 1986).
However, although this clearly indicates that at least those compound words whose parts
have a high degree of 'frozenness' are accessible to the methods of standard spelling correction systems, it is true that these systems try at best to cope with (some) compound
nouns while they are still ignorant of the bulk of other compound forms and of violations
of lexical and/or co-occurrence restrictions in general.
As Zimmermann (1987) points out with respect to German forms like "in bezug auf'
(=frozen) versus "mit Bezug auf' (= free), compounds are clearly outside the scope of
standard spelling correction systems due to the fact that these systems only check for isolated words and disregard the respective contexts.
Following Gross (1986) and Zimmermann (1987), we propose to further extend standard
spelling correction systems onto the level of compound words by making them contextsensitive as well as capable of treating more than a single word at a time.
113
Yet even on the level of single words many more errors could be detected by a spelling
corrector if it possessed at least some rudimentary linguistic knowledge. In the case of a
word that takes irregular forms (like the German verb "laufen" or the English noun
"mouse", for example), a standard system seems to "know" the word and its forms for it is
able to verify them, e.g. by simple lexicon lookup. Yet when confronted with a regular
though false form of the very same word (e.g. with "laufte" as the Istl3rd pers. sg. simple
past indo act, or with the plural "mouses"), such a system normally fails to propose the
corresponding irregular form ("lief' or "mice") as a correction.
Following a suggestion in Zimmermann (1987), we propose to enhance standard spelling
correction systems on the level of isolated words by introducing an additional type of lexicon entry that explicitly records those cognitive errors that are intuitively likely to occur
(at least in the writings of non-native speakers) but which a standard system fails to treat
in an adequate way for system intrinsic reasons.
15.2.2 Overview of new Phenomena for Spelling Correction
As there are irregular forms which are nevertheless well-formed, i.e.: words, there are also
regular forms which are ill-formed, i.e.: non-words. Whereas words are usually known to
a spelling correction system, we have to add the non-words to its vocabulary in order to
improve the quality of its corrections.
On the level of single words in German, non-words come from various sources and com-
prise, among others, false feminine derivations of certain masculine nouns (*Wandererin,
*Abtin), false plurals of nouns (*Thematas, *Tertias), non-licensed inflections (*beigem,
*lila(n)es) or comparisons (*lokaler, *minimalst) of certain adjectives, false comparisons
(*nahste, *rentabelerer), wrong names for the citizens of towns (*Steinhagener,
*Stadthliger), etc. Some out-dated forms (e.g.: PreiBelbeere, verkliufst, aberglliubig) can
likewise be treated as non-words.
It is on the level of compounds that words rather than non-words come into consideration
again, when we look for contextual constraints or co-occurrence restrictions that determine orthography beyond the scope of what can be accepted or rejected on the basis of
isolated words alone.
For words in German, these restrictions determine, among other things, whether or not
certain forms (1) begin with an upper or lower case letter; (2) have to be separated by (2.1)
blank, (2.2) hyphen, (2.3) or not at all; (3) combine with certain other forms; or even (4)
influence punctuation. Examples are:
Ich laufe eis.
Ich laufe auf dem Eis.
versus
versus
2.1)
2.3)
versus
2.1)
2.3)
versus
1)
114
2.2)
2.3)
Er liebt Ieh-Romane.
Er liebt Romane in Iehform.
versus
3)
versus
4)
15.2.3 Method
The extensions proposed in (1) above are eonselVative, in the sense that their realization
simply requires widening the scope of the string matching/comparing operations that are
used classically in spelling correction systems. No deep and time-consuming analysis. like
parsing. is involved.
Restricting the system in this way makes our approach to context-sensitivity different
from the one considered in RimonlHerz (1991). where context sensitive spelling verification is proposed to be done with the help of "local constraints automata (LCAs)" which
process contextual constraints on the level of lexical or syntactic categories rather than on
the basic level of strings. In fact, proof-reading with LCAs amounts to genuine grammar
checking and as such belongs to a different and higher level of language checking than the
extensions of pure spelling correction proposed here.
Now. in order to treat these extensions in a uniform way. each entry in the system lexicon
is modelled as a quintuple <W.L.R.C.E> specifying a pattern of a (multi-) word W for
which a correction C will be proposed accompanied by an explanation E just in case a
given match of W against some passage in the text under scrutiny differs significantly
from C and the - possibly empty - left and right contexts L and R of W also match the
environment of W's counterpart in the text.
Disregarding E for a moment. this is tantamount to saying that each such record is interpreted as a string rewriting rule
W-->C / L_R
replacing W (e.g. Bezug) by C (e.g. bezug) in the environment L_R (e.g. in_auO.
The form of these productions can best be characterized with an eye to the Chomsky hierarchy as unrestricted since we can have any non-null number of symbols on the LHS
replaced by any number of symbols on the RHS. possibly by null (Partee 1990).
With an eye to semi-Thue or extended axiomatic systems one could say that a linearly
ordered sequence of strings W. C 1. C2 ... Cm is a derivation of Cm iff (1) W is a (faulty)
string (in the text to be corrected) and (2) each Ci follows from the immediately preceding
string by one of the productions listed in the lexicon (partee 1990).
Thus. theoretically. a single mistake can be corrected by applying a whole sequence of
productions. though in practice the default is clearly that a correction be done in a single
115
derivational step, at least as long as the system is just operating on strings and not on additional non-terminal symbols.
Occurrences of W, L, and R in a text are recognized by pattern matching techniques. Since
the patterns for contexts allow L and R, in principle, to match the nearest token to be
found within an arbitrary distance from W, we have to restrict the concept of a context in a
natural way in order to prevent L and R from matching at non-significant places. Thus, by
having the system operate sentencewise, any left or right context is naturally restricted to
some string within the same sentence as W or to a boundary of that sentence (e.g.: a punctuation mark).
In case a correction C is proposed to the user, an additional message will be displayed to
him or her, identifying the reason why C is correct rather than W. Depending on the user's
knowledge of the language under investigation, he or she can take this either as an opportunity to learn or as a guide for deciding whether to finally accept or reject the proposal.
116
text - disturb each other's results by proposing antagonistic corrections with respect to one
and the same expression: within the correct passage "in bezug auf', for example, "bezug"
will first be regarded as an error by the standard checker which then will propose to
rewrite it as "Bezug". If the user accepts this proposal he will receive the exactly opposite
advice by the context sensitive checker.
On the other hand, checking on different levels could go hand in hand nicely and produce
synergetic effects: for, clearly, any context sensitive checking requires that the contexts
themselves be correct and thus possibly have been corrected in a previous, eventually context free, step. The checking of a single word could in tum profit from contextual knowledge in narrowing down the number of correction alternatives to be proposed for a given
error: While there may be some eight or nine plausible candidates as corrections of
"Bezug" when regarded in isolation, only one candidate, i.e. "bezug", is left when the context "in_auf' is taken into account.
Thus, there is a strong demand for arriving at a holistic solution for multi-level language
checking rather than for just having various level experts hooked together in series. This
will be the task for the near future.
15.2.5 Portability
The software of the system described is modular in the sense that it can be integrated into
any word processing software. We have already ported the German prototype from the TA
SWP word processing program into Microsoft's WinWord 1.1, for example.
As concerns the lingware, we take it that a similar approach is also feasible for languages
other than German. Although in comparison with English, French, Italian, and Spanish,
German seems to be unique as regards the relevance of the context for upperlIower case
spellings in a large number of cases, there are at least, as indicated in (Gross 1986), the
thousands of compounds or frozen words in each of these languages which are clearly
within reach for the methods discussed.
Grammar Checkers today cover corrections that deal with misspellings that can only be
captured within the sentence context Some can deal with complex grammar errors that
concern verbal arguments or normative cases set down in academic books.
Spell Checking tools differ greatly from language to language. Here we could compare the
approach of the "Extended TWB Speller for German" (loaned from the Duden Norm), or
the commercial "Grammatik If'.
Different users, according to their profile, make different mistakes. For instance, Spanish
native speakers almost never make agreement mistakes; even non-native speakers do not
have many problems with this. When writing technical documents, the most usual mistake
is wrong tenses, wrong appositions and problems with reflexive forms or with marked
prepositions. All these cases can only be captured by means of parsing and with a robust
lexical grammar.
In a wider approach to be presented in Sect. 16.3 on Verification or Controlled Grammars, the effort not only concentrates on getting rid of wrong grammatical sequences, but
rather on presenting an integrated framework of controlling the user's language and thus
aiding a full grammar and style analysis. To this respect, the first two approaches give a
more limited, agreement-based view on grammar checking, which is comparatively faster
and has been shown to be portable to PCs, whereas the full-fledged style checking provides the in-depth analysis. Thus the two approaches are complementary, depending on
the task and the time available.
118
editors, has been under development at AT&T since 1979. Target machines are computers
with a UNIX operating system. Beyond well known facilities such as spell checking, Writer's Workbench does style and grammar critiquing, though it cannot do critiquing that
requires a parser output. Thus, its style critiquing is restricted to phenomena which can be
analysed with the help of small dictionaries, simple patterns of phrases and statistical
methods. Errors which can be checked by software using this approach are errors like split
infinitives, wordy phrases, wrong word use, jargon, gender-specific terms, etc. The user
can be provided with information about frequency of passive voice, wordiness, frequency
of abstract words and often readibility scores.
EPISTLE (Heidorn 1982; Jensen et al. 1986) and its successor CRITIQUE have been
developed by IBM as a mainframe text-proofing system. They are based on concepts and
features of Writer's Workbench. Unlike those of Writer's Workbench, the checking tools
of EPISTLE use an integrated parser. This allows EPISTLE to cover a wider range of
grammatical and stylistic errors. In addition to the features described in Writer's Workbench above, EPISTLE is able to check grammatical phenomena like errors in agreement
(between subject and verb, noun and premodifier), improper form of the infinitive, etc.
What about commercial successors of EPISTLE? There seems to be one, looking at
Microsoft's Word for Windows. The Beta-Version of Word for Windows 2.0 includes a
grammar checking facility covering most of the features of EPISTLE - a very remarkable
fact. This would be the first time that an enhanced grammar checking utility including a
parser is integrated in a very common text processing system thus reaching a larger
number of people.
AT&T, mM and Microsoft as pioneers of grammar checking - this looks like a snapshot of
a book "who's who in the American computer industry". Are there any efforts to do grammar checking for other European languages - others than English? Aren't there any nonAmerican, say European efforts in this area?
To answer the first question: It must be admitted that most commercial software is
restricted to English. Concerning the second question: There are some efforts - namely the
Chandioux group released in 1990 GramR Le Detecteur, a DOS based grammar checker
for French. But this is too little compared with the efforts spent by American researchers.
Checking tools in TWB
On this background checking tools for European languages are very important. Partners of
several European countries (Greek, Spain, Germany, Great Britain) and thus the corresponding linguistic knowledge is concentrated in the Translator'S Workbench project.
With respect to German, we can summarize that although simple spelling checkers for
German are already integrated in standard text processing systems, there is a lack in
checking more complex errors. While the context-sensitive spell checker has been
described in the previous section, this paper will focus on the grammar checker developed
at TA Triumph-Adler AG.
Empirical Study
Is there in German really a need for such a sophisticated tool like a grammar checker? Is
the percentage of grammatical mistakes significant compared to the total of mistakes
119
occurring in German texts? To answer this question a study was carried out by the University of Heidelberg, Department of Computer Linguistics (see HellwiglBub/Romke 1990).
The corpus includes 1000 errors and the most frequent errors can be classified in errors of
Agreement
Choice of Words
Prepositional Phrase
Syntax of Determiner
As shown in Fig. 40, the orthographic errors are the most frequent (totally 25.5%), but
also errors of agreement occur very often. In addition to errors of agreement within noun
phrases (18%) there are errors of agreement between subject and verb (3.5%).
DET N
DET ADJ N
More complex structures as (c) and (d) have the problem that they not only consist of terminal symbols (DET N ADJ) but also of nonterminals (NP, PP, Rel.-Clause, .. ), i.e. a
recursive algorithm is needed to parse the structure correctly.
(c) Die innovative von TA eingefUhrte Neuerung (... )
Det N Rel.-Clause
120
Checking correct agreement between subject and verb leads to another problem. Subject
and verb agree in number and person. There are no problems if the subject consists of a
single noun as in (e). And there are even no problems if the subject consists of coordinated
plural nouns. Difficult cases are the cases in (g) and (h) where two singular nouns are
coordinated resulting in a plural noun phrase.
(e) Ich bin (... )
NP(l.person) V(l.person)
(f) Die Drucker und die Laufwerke sind (... )NP(plural) KOORD NP(plural) V(plural)
(g) Der Drucker und das Laufwerk sind (... )NP(singular) KOORD NP(singular) V(plural)
(h) Ich und Du sind (... )
V(l.pers/pl)
NP(l.perslsg)
KOORD
NP(2.pers/sg)
Looking at these examples we see that recursivity and a certain complexity of algorithms
is necessary to parse the structure of NPs and to check agreement. But how can we avoid
the expensive recursivity (expensive concerning both time and space) and where should
algorithms checking (g) and (h) occur?
121
complex subclauses within NPs? We make parsing on two levels: In the first step we build
a simple phraselist from the tokenlist which includes the lexical information, i.e. we build
simple phrases (coordinated ADJs, PPs, simple NPs). In a second step we build complex
phrases (coordinated NPs, coordinated PPs, participles within NPs, and so on). Another
advantage of the division between simple phrase list and complex phrase is that we can
avoid recursivity.
122
--I
core
grammar
rules
peripheral rules
The Spanish Grammar Checker as mentioned before only covers nominal constructions
(with adjectival phrases, participles and all sorts of appositions, e.g. acronyms, abbreviations, etc.). We do not cover grammatical relative clauses.
The model behind the implementation is very simple. It can be applied to any grammar
model; it simply adds the possibility of allowing additional "paths" as well-formed strings
in the agenda during parsing.
The system applies unification gradually but on different feature sets, and each time upon
a smaller set of restrictions. The rule triggered off (PSR) is the same, but obviously with a
wider scope, so that it can accept incorrect input.
This mechanism is contained in each rule. However, if the Grammar Checker were implemented table driven, users could select options that would internally trigger off the application of different feature unifications. In that way users could tune the scope of the
grammar checker according to user types (foreigner! native) or document type (technicaV
administrative).
Both Spanish checkers (Grammar and Speller) run on SUN Spare and run separately.
The Grammar Checker uses the METAL Spanish dictionary (20,000 entries) and 40 analysis rules (for nominals and restricted sentence constructions).
16.2.1 Coverage
The most common grammar errors are mispellings that concern the gender and number
agreement of words. However, most of these errors can be cought via spelling correction.
The grammar checker mainly deals with the agreement of determiners or adjectives because in the romance languages, there must be agreement for both features.
Moreover, native speakers will never make such mistakes because they have clear competence in assigning the correct gender and number of words, and the keyboard layout makes
it very difficult to type the key for "0" instead of that of "a".
Within TWB we developed a reduced grammar for Nominals that handles two types of
phenomena: easy copula sentences with predicate agreement, and nominal appositions.
Appositions were divided into: narrow appositions, namely defining modifiers, and wide
appositions (non defining modifiers).
123
This TWB component is the result of trying to optimise the documentation process. Considerations of how to improve the input of machine translation were compared with guidelines for technical authors. and large overlaps were detected. This resulted in a common
effort to improve both the readability and the translatability of texts. by setting up
styleguides for authors. Since they define a sublanguage of their own. in that they restrict
the grammar of a language. they are called controlled languages.
There are several reasons for setting up controlled grammars:
corporate identity requires the use of certain terms and expressions instead of others;
ease of readability and understanding also for non-native speakers require very limited
grammar usages (e.g. in the case of AECMA);
ease of translatability also restricts the language (e.g. in Xerox' adapations for SYSTRAN translation).
124
Our task was to implement software which compares texts with those guidelines and flags
the deviations. This is the content of the Controlled Grammar Verifier in TWB.
16.3.1 Architecture
There are two possible architectures for a Controlled Grammar Verifier. Either only the
subset described by the Controlled Grammar is implemented; anything that cannot be
parsed is considered to be ill-formed. Or a full parser is implemented and deviations are
flagged. The former apporach is easier to implement but has some drawbacks:
It is correct that deviations lead to parse failures. But no other parse failures must
occur; else a parse failure is not meaningful anymore. This cannot be guaranteed, however.
No error diagnosis can be given, as a parse failure gives no hints where or why the
parser failed. This is not considered to be user-friendly.
We therefore decided to implement the latter strategy in TWB. Here, the overall architecture of the system looks as described in Fig. 42.
In this approach, a Controlled Grammar Verifier basically consists of four components:
an input sentence analyser which produces linguistic structures of the sentences to be
checked;
a collection of linguistic structures which are considered to be ill-formed (according to
some criteria);
a matcher which matches the input structures with the potential ill-formed structures
and flags the deviations;
an output module which produces useful diagnostic information. We do not, however,
intend to produce automatic correction. This is too difficult at present
For a number of reasons, described in Thurmair 1990, we chose the METAL analysis
components.
iltformed structures
repository
125
However, as a Controlled Grammar Verifier deals with a subset of the grammar, it should
be implemented such that the grammar is not touched at all. The only thing needed should
be a description of the trees the grammar produces; then any parsec and grammar can be
used as long as it produces the kind of trees specified.
In lWB, this was the guideline for the implementation. Nothing was changed inside the
METAL analysis components to perform the diagnosis.
The second component to be considered was the rules of the controlled grammars; they
mainly consist of things to be avoided (too long sentences, too complex compounds, too
many passive constructions, ambiguous prepositional phrases, etc.). The phenomena were
collected from an examination of the relevant literature (cf. Schmitt 1989).
The first task here is to reformulate the statements of the controlled grammars in terms of
linguistic structures and features: Which structures should be flagged when a sentence
should not be "too complex"? Which structures indicate an "ambiguous prepositional
phrase attachment"? The result of this step was a list of structures, annotated with features,
the occurrence of which indicated an ill-formed construction.
The next step was to find a representation for these structures. It had to be as declarative as
possible, which led to two requirements:
the structures should be stored in files, not in programs, not only in order to ease testing, but also to change applications (and languages) later on (e.g. from German Siemens Nixdorf Styleguides to English AECMA controlled languages);
the structures should be declared in some simple language, describing precedence and
dominance of nodes, presence and absence of features I feature combinations and values. Any linguist should be able to implement their own sets of controlled grammar
phenomena and call them by just specifying their specific diagnostic file.
Both requirements were fulfilled in the final TWB demonstrator; the ill-formed structures
are collected in a file which is interpreted at runtime; and the structures are described in a
uniform and easy way (cf. Thurmair 1990).
16.3.4 The Matcher
The matcher is the central component of the verification software. It matches the structures of the ill-formed structure repository with the input sentence, applying the feature
and tree structure tests to the input tree. This process has to be done for all subtrees of a
given syntax tree recursively (as there may be diagnosis information on aU levels of a
tree). For every positive match, the matcher puts a feature onto the root node of the input
tree, the value of which indicates the kind of ill-formedness, and gives a hint for the production of the diagnostic information.
As a software basis for the matcher, we were able to use a component of the METAL software which performs tree operations. The output of the matching process is the input analysis tree, modified by some features if ill-formed structures were found.
126
A closer look at three phenomena showed encouraging results. In the case of complex
prenominal modifications, all cases (13 overall) had been identified and flagged correctly.
In the case of unclear PP attachment, all cases (23 overall) had been found if the sentences
could be parsed; ambiguities not identified (overall II) were due to parse failures. In the
case of wrong specifier formations, all cases (overall 8) except one had been found. These
results show that on the basis of a good large coverage analysis, precise and helpful diagnosis is possible.
What turned out to be a problem is the treatment of parse failures (about 25% of the texts
could not be parsed). In this case, the diagnosis can be errorneous; e.g. if the system flags
127
an incomplete sentence structure which is due to the fact that the parser could not find all
predicate parts). This requires improvements both in the parsing and analysis phase (the
verifier must know if a sentence could not be parsed) and in the diagnosis phase (be more
robust here).
16.3.7 Next Thsks
In order to make the Controlled Grammar Verifier really productive, the following tasks
must be performed:
Port the Controlled Grammar Guidelines to other applications. This should demonstrate if the modular architecture which is based on specification files really works. This
task has begun with an external pilot partner.
TIme the system to find out if what it flags is what human users would flag as well. For
example: Are all constructions marked as "too complex" by the system also considered
to be complex by human readers? This also relates to the number of flags to be allowed
for the construction to be acceptable.
Improve the quality of the output component. We must be able to refer to the original
text in giving diagnostics. For example, if a sentence contains three large compounds,
we must tell the users which of them the message "unclear compound structure" refers
to. We also need good text related scores (e.g. for readability).
Finally, we need a better user interface which allows for the selection of some parameters (do not always check everything) and other more sophisticated operations.
17.1 Introduction
A correct translation includes the correct construction of phrases. Obviously, the knowledge involved in this task is very complex. A great deal of the efforts in learning a language must be devoted to syntactic properties like admissible complements of words,
word order, inflection and agreement The opportunities for making mistakes are as many
as the number of appropriate syntactic constructions . Therefore a tool for assessing the
syntactic correctness of a translation is a very desirable module of the Translator's Workbench. Such a module is under development at the Institute for Computational Linguistics
of the University of Heidelberg.} The present work relies on basic research conducted in
the framework of the PLAIN system (Programs for Language Analysis and INference)
starting in the mid-1970s. 2 The linguistic theory underlying the application is Dependency
Unification Grammar (DUG).3
There are three levels of syntactic support for text processing systems which are characterized by increasing complexity. The first level supplies mere recognition of ill-formed
phrases. Any parser should master this level, since a parser must, by definition, accept
well-formed phrases and reject ill-formed ones.
The second level of support consists in flagging the portions of the phrase which are incorrect. However, this goal is not easy to achieve. Even correct sentences contain a lot of
local syntactic ambiguities and, hence, a lot of dead-ends, which are removed only when
the final stage of a correct parse is reached. If the latter situation does not occur because
the input is invalid, "normal" dead-ends of the analysis and the incorrect portions of the
phrase which are responsible for the parsing failure are hard to distinguish.
The most comfortable level of support is the automatic creation of the correct phrase.
While the first two tasks belong to language analysis, the correction of an iII-formed
phrase is an instance of language synthesis. To our knowledge, linguistic theory has not
expended much effort on clarifying the mechanisms of error correction and has not yet
elaborated a general and uniform solution to this problem. In any case, it is obvious that
syntax checking up to the third level ranks among the greatest challenges of computational
linguistics.
Error correction is a problem which is in conflict with the set-theoretic foundation of the
theory of formal languages which, in tum, is the basis of natural language processing. A
language L is formally defined as the set of well-formed strings which are generated by a
1. The following personnel have contributed to the project: Bernhard Adam, Christoph BUisi, Jutta
Bopp, Vera Bub, Karl Bund, Ingrid Daar, Erika de Lima, Monika Finck, Marc-Andre Funk, Peter
Hellwig, Valerie Herbst, Christina KUine, Heinz-Detlev Koch, Harald Lungen, Imme Rathert,
Corinna Romke, Ingo Schiltze, Henriette Visser, Wolfgang Wander, Christiane Weispfennig.
2. See Hellwig 1980.
3. See Hellwig 1986.
129
doing" them. At the first view, this explanatory approach to ill-formed phrases is psychologically appealing. lll-formedness is rule-based, as Weischedel and Sondheimer point
out9 .
The disadvantage of error anticipation is the tremendous empirical work which is necessary for drawing up the peripheral grammar. The "rules" according to which the erroneous
phrases are constructed are often introduced by interference from the rules of the native
language of the translator. It is, of course, impossible to take into account all the languages
of the world and their impact on making mistakes in the target language. Therefore, an
anticipation-based syntax checker will never be exhaustive.
If we turn from the psychology of the translator to the psychology of the corrector, we
notice that the latter is able to correct a phrase even if he has not been confronted with the
same mistake before. He does not need to know why the translator made that mistake. The
only knowledge the corrector needs is a knowledge about the correct phrases of the target
language which are defined by the grammar G. As a consequence, it is reasonable to
model an automatic syntax checker similar to a native speaker of the target language who
is to proof-read and correct a translation.
We decided to adhere to the following guidelines for the implementation of our syntax
checker: In the same way as the parser, the algorithm for error detection and correction
must be uniform and language independent. It must not require any linguistic data in addition to the grammar that generates the correct syntactic constructions of the language in
question. Drawing up the lingware necessary to parse correct sentences is already difficult
and costly enough. The portation of the system to a new language should not be burdened
with the task of anticipating the errors that can be made in that language. On the contrary,
the exchange of one grammar for another should at the same time enable the parser to
assign a structural representation to another input, as well as enabling the syntax checker
to detect and correct mistakes made in the new language.
The possibility of correcting a distorted text results from the contextual redundancy of natural languages. The words in a phrase give rise to the expectation of other words with certain properties and vice versa. The basic mechanism of error correction applied in proofreading seems to be the reconciliation of such expectations with the actual context. As
long as there are sufficiently precise expectations of what the complementary context
should look like, the actual data is likely to be adjustable even if it is ill-formed. This leads
to the conclusion that the key to error correction without peripheral grammar is the availability of extensive expectations created by the parser.
The Dependency Unification Grammar (DUG) used in the PLAIN system advocates a lexicalistic approach, i.e. the notion of syntactic structure is derived from the combination
capability of words rather than from the constituency of larger units. The combination
capability of words (i.e. their contextual expectation) is described by means of templates
8. An empirical study of approximately 1000 syntactic errors occurring in examination papers of
German as a foreign language has been conducted by our group.
9. See WeischedeVSondheimer 1983. Anticipation of errors is assumed, too, by Guenthnerl
Sedogbo 1986, Mellish 1989 and Schwind 1988.
131
that assign slots to the word in question. The parser tries to fill the slots with appropriate
material. When an error occurs, there will be a gap between the portions of the text analysed so far, because the latter do not meet the expectations of one another. The syntax
checker inspects the analysed portions around a gap for open slots that specify the correct
context At the same time, all forms of the inflectional paradigm of the adjacent portion
are generated and the one that meets the expectations stated in the corresponding slot is
chosen.
We will concentrate in the sequel on the following important features of the PLAIN parser
and syntax checker:
a word-oriented approach to syntax as opposed to a sentence-oriented approach;
syntax description by equations and unification;
parsing based on the slot and filler principle;
parallelism as a guideline for the system's architecture;
error detection and correction without any additional resources.
132
Arthur
NPI
left
VI
Arthur
NPI
attended
V2
the meeting
NP2
lobn
NPI
gave
V3
a present
NP2
lobn
NPI
gave
V3
me
NP3
a present
NP2
reminds
V4
me
NP2
of the meeting
PP_of
with you
PP_with
Sbeila
NPI
I
NPI
differ
V5
to Mary
PP_to
belps
V6
amazes me
V7
NP2
Sbeila
NPI
agrees
V8
Arthur agreed
NPI
V8
Sheila
NPI
persuaded
V9
him
NP2
wonder
VIO
wbether he left
CL_whether
NPI
133
NP2
(NP3)
v (
PP_wlth
CL_that
NP2
INF
The constructs in Fig. 45 are remarkable from several points of view. First of all. the data
in square brackets is a precise representation of the contextual expectations which are
associated with the respective words. Making them available is an important step towards
the strategy for error correction which we have sketched above. As opposed to simple distributional categories. like VI. V2 ... V9. the complex categorization in Fig. 45 is transparent. 12 Each complex category denotes explicitly the syntactic properties of the words
in question. If a categorization according to Fig. 45 is available. there is little information
which the rule in Fig.44 adds to it. except the fact that the presence of the elements mentioned in the subcategorization results in a sentence. If we neglect the position of V for the
moment. we can replace the rule in Fig. 44 by the more abstract rule in Fig. 46.
leave
attend
[NPl_NP21
give
[NPl_1
remind
differ
help
[CL_that_l
amaze
[CL_that _ NP2]
agree
[NP1 _ CL_thatl
[NP1 _ INF]
persuade
[NP1
wonder
[NP1 _ CL_whetherl
_ NP2INF]
S-> V [Xl X
135
verb
verb
verb
verb
verb
verb
verb
verb
verb
verb
verb
nouo
noun
noun
prep
prep
prep
prep
conj
conj
conj
(+subject)
(+subject, +direct_object)
(+subjeCl, +direccobject. +indireccobject)
(+subject. +direccobject, +prep_objeccoO
(+subject. +prep_objeccwith, +prep_object_about)
(+subjeccclause)
(+subjeCl_c1ause. +direccobject)
(+subject. +objecCclause)
(+subject. +direccobject. +object_clause)
(+subject. +whelher3lause)
(+infinitive)
(+detennination)
(+detennination)
(+detennination)
(+noun_phrase)
(+noun_phrase)
(+nouo_phrase)
(+noun_phrase)
(+subclause)
(+subclause )
(+infinitive)
The similarity between the subcategorizations in Fig. 45 and the complement assignments
in Fig.48 are obvious. Both specify contextual expectations.
The next step is to describe the building of structure and the morpho-syntactic properties.
In most grammars, a set of rules serves this purpose. The DUG uses templates which mirror the dominating and the dependent nodes in the dependency tree directly. A template
consists of a dominating node which carries the template's name and, possibly, a morphosyntactic category which restricts the form of the governing word. The template's name
can be the individuallexeme of a word. In this case, the template will apply to that word
directly. The other templates apply to a word if they are assigned to it in the lexicon. Figure 48 is an example of such assignments.
A template consists. furthermore, of a so-called slot or. possibly. a disjunction of slots. A
slot functions as a variable for a subtree which is to be subordinated to the governing node.
The slot contains a precise description of (the dominating node of) the complement that is
to fill the slot. Normally, a slot includes a role marker, a variable for the filler's (top-most)
lexeme. and a more or less complex morpho-syntactic characterization. There might also
be a selectional restriction associated with the lexeme variable. An important augmentation of DUG, as opposed to traditional dependency grammars, is the inclusion of positional categories in the node labels.
Dependency trees, including templates, are represented in the PLAIN system by bracketed
expressions. We turn to this format in the subsequent illustrations. Some (simplified) templates necessary for the construction of the example sentences are presented in Fig. 49.
(* : +subject
(* : +prep_objeccof
(* : +<letennination
(: _ : noun adjacent[to_the_right]))
(* : +subclause
When a word in the input is processed, the templates are looked up which have been associated with the word in the lexicon. The slots in the templates are then ascribed to the word
itself. For example, the word "amaze" is specified in Fig. 48 for +subject_clause and
+direct_object When the parser reaches the word in the input, the following structure will
be the result of consulting the lexicon: Figure 50 illustrates the way in which the DUG
centers the grammatical information around the individual words, thus giving a detailed
account of the contextual expectations which aris~ with each word. In a rule based phrase
137
structure grammar, this information is not as overtly available. As we have argued above,
just these contextual clues are crucial for the corrector if something went wrong in a
phrase and has to be fixed.
( .. : amaze : verb
I feel angry
you feel angry
he feels angry
we feel angry
you feel angry
they feel angry
DUG represents morpho-syntactic properties by sets of typed features. Each feature, e.g.
singular or plural, is classified according to its type, e.g. number. For instance, singular is
represented as number[singular). The expression in square brackets is called a "value".
This representation is widely accepted now in modern grammar theories. As opposed to
other grammars, DUG allows atomic values only (i.e. a value cannot be a typed feature
again). However, a set of alternate values is admitted, e.g. person[lst,3rd).
The advantage of the explicit indication of types lies in the fact that agreement between
features can now be formulated in terms of the feature type. Instead of stating that singular
must go with singular and plural must go with plural, one can state that number must equal
number. (Of course, this is what traditional linguists have always done.) The categories in
syntactic descriptions function in the same way as constants and variables in an equation.
The instantiations of variables must be in agreement with each other. This principle has
already been applied in the abstract rule in Fig. 46. The calculation of consistent variable
instantiations in a syntactic formula is called "unification". The term has been taken over
from logic theorem proving. Grammars which employ similar techniques are known as
"unification grammars".
DUG is based on the unification of partial dependency trees, especially on the unification
of slots and fillers. This is the only device for building and recognizing phrases. There are
no production rules which have to be applied in order to create structures. The unification
138
of features is not a blind mechanism in DUG, but can be tuned to the specific situation.
The requirement of agreement between features of a certain type is explicitly stated by
means of the special values C (bottom-up agreement) and U (top-down agreement).
The agreement phenomena in Fig. 51 are accounted for by the following assignment of
templates to the verb "lirgem":
lirgem
(* : +subjecl
Syntactic features are specified in slots, they are compared with those of a potential filler
and they are propagated across the nodes of the dependency tree if the value "C" is specified. If a feature type is marked for agreement in two categories, then the intersection of
values of both sets is formed. If this process results in an empty set for any type, then the
unification has failed and the filler must not enter the slot.
Let us assume that the inflected word form "lirgem" (not to be confused with the lexeme
"lirgem") is encountered in the input The corresponding entry in the morphologicallexicon and the templates in Fig. 49 would yield the description of this word and its contextual
expectations shown in Fig. 53:
(* : lirgem : verb person[1st,3rd] number[piural]
The personal pronouns which are potential fillers of the subject slot and the reflexive pronouns which are potential fillers of the REFL-slot must be specified in the morphological
lexicon in a way similar to Fig. 54.
(* : ich : noun case[nomin] person[lsl] number[singuiarlJ)
(* : wir : noun case[nomin] person[1sl] number[piural])
(* : sie : noun case[nomin.acc] person[3rd] number[singular.piural])
(* : uns : reflpron case [ace] person[lsl] number[piuralJ)
(* : sich: reflpron case[acc] person[3rd] number[singuiar.piuralJ)
The following phrases built with "lirgem" and the material in Fig.54 would result from a
successful unification of pronouns and slots:
139
(number fails)
(number and person fail)
(person fails)
(person fails)
140
tions of each word or segment are locally available, then each pair of a word and its potential complements can be processed at the same time without interference.
1 TluJt 1
(* : that)
+subclause
2 Arthur 2
(* : Arthur)
3 attends 3
(* : attend)
+subject, +direct_object
2 Arthur attends 3
(*: attend (SUBJECT: Arthur
+direccobject
4the4
(* : the)
5 meeting 5
(* : meeting)
+determination
4 the meeting 5
(*: meeting (REFERENCE: the
6 amazes 6
(*: amaze)
+subjeccclause, +direct_object
7 me7
(*: me)
1 TluJt Arthur attends the meeting amazes me 7
(* : amaze (SUBJECT: that (pREDICATE: attend (SUBJECT: Arthur) (DIR_OBJECT : meet-
8.8
(IlLOCUTION : assertion)
The basic object of the PLAIN+ parser is the "bulletin". A bulletin provides aU the information about a segment of the input, its coverage, its morpho-syntactic features, its
dependency representations, its associated templates which indicate the combinatorial
potentials. Bulletins can be accessed via a "bulletin board". The bulletin board reports
which bulletins are already existant with respect to the input segments.
The overall control of the parser is achieved by means of an agenda. The agenda is a list of
processes and arguments. As soon as there is free computing power, a process is taken
from the agenda and is executed. Each execution consists of a sequence of actions, possibly producing new data (e.g. new bulletins). All processes on the agenda can be executed
in parallel. The main processes are:
.
the scanner, which .reads the next word from the input and extracts the morpho-syntactic information for this word from the lexicon;
the predictor, which collects all of the combinatorial information for the word from the
lexicon;
the selector, which chooses the pairs of bulletins that should be compared for compatibility;
the completor, which tries to combine segments on the basis of the information in their
bulletins in order to construct a coherent dependency tree;
the assess-result process, which checks the presence of a successful parse.
Except for the initial process (i.e. the scanner for the first word), processes and their arguments are put on the agenda by other processes. The scanner puts a scanner process for the
next word on the agenda. Furthermore, it puts one predictor process for each reading of
the actual word on the agenda. If the scanner encounters a sentence boundary, then it puts
the assess-result process on the agenda.
The predictor constructs a bulletin for the reading in question and supplies, most importantly, the templates which specify the combination capabilities of the word. There might
be alternate combination frames. For each frame, the predictor puts an instance of the
selector process on the agenda.
The selector looks up all bulletins which are to be compared with the actual one. The
selector consults the bulletin board for this purpose. The selector puts a completor process
on the agenda for each pair of bulletins it has chosen. If the heap of appropriate bulletins is
exhausted, the selector process might have to wait until more bulletins have been produced by other processes.
The completor tries to fit one of the two structures described in the bulletins into a slot of
the other one, and vice versa. In each of the successful cases (there might me more than
one possibility), the completor draws up a new bulletin. This bulletin provides the data
about the resulting structure and is put on the agenda as the argument of a selector process.
This technique will cause a recursive application of the completor to larger and larger
units. The parser will stop when a sentence boundary has been encountered and the agerida
is empty.
142
143
The assess-result process proposes the successful parses as candidates for the correction.
If no such result has been achieved, then the revisor process is put on the agenda again in
We conclude with an example. Let us assume that the input is "Sheila agree Arthur leave".
1 Sheila 1
(* : Sheila: noun person[3rd] number[singular]
2 agree 2
(* : agree: verb person[lst,2nd] number[singular]
2 agree 2
(* : agree: verb person[lst,2nd,3rd] number[plural]
3 Arthur 3
(* : Arthur: noun person[3rd] number[singular]
4/eave 4
(* : leave: verb person[l st,2nd] number[singular]
4/eave 4
(* : leave: verb person[ 1st,2nd.3rd] number[plural]
4 leaves 4
(* : leave: verb person[3rd] number[singular]
1 Sheila agrees 2
(* : agree: verb person[3rd] number[singular]
3 Arthur leaves 4
(* : leave: verb person[3rd] number[singular]
There is still no complete parse. Therefore, the revisor is activated again. There is a new
path consisting of two segments (as displayed in Fig. 58). However, the generator cannot
create any new word forms. In this situation the slots of the segments are inspected. The
generator encounters the selectional lexeme "that" and creates the corresponding word.
The parser now tries again and comes up with the result shown in Fig. 59.
18.
18.1 Introduction
The modem Greek language is the result of a continuous evolutionary process that spans
over 3000 years. This long evolutionary process resulted in a language that is both rich
and loosely-defined.
The Greek language is characterized by the unavailability of a concrete and unambiguous
definition, and by its complicated grammatical and orthographic rules. This complexity
has hindered the development of electronic processing tools for the language. Furthermore, the relatively small market size made investments in language tools not an obvious
proposition.
The TWB provided the opportunity by which a number of internal developments by LCUBE were assembled to a coherent set of language tools. A lexicon of 40000 lemmas has
been developed, a spelling checking / thesaurus / hyphenation engine has been built and
has been ported to UNIX, Apple, and DOS environments. Finally tight bindings have been
implemented with Quark XPress in the Apple environment and FrameMaker under UNIX.
18.2 Background
The Greek language has to be understood as a continuum of types and rules whose origin
lies in ancient Greek, coexisting with more recent ones, and where the proper spelling of a
word or its types is quite often up to interpretation. Looking back only two hundred years,
there are at least two major definitions of the "Greek language" that have been proposed
and widely used. The first one contains vocabulary and rules closer to the ancient Greek
(which in itself is a uniquely defined entity), while the second is closer to the everyday
spoken Greek. A variant of the latter has been adopted by the Greek Government as the
official form of the language since the early 1980s. This form of the language has been
chosen to be supported by the Greek Language Tools. The decision has been based on
consultations with potential users of the tools, and also from the fact that inclusion of
words or types belonging to other forms of the language was problematic at least for PCs,
since available character sets have no provision for the plethora of accents that will be
required.
Another issue for the development of language tools for the Greek language is the character encoding for the Greek characters. There exists an International Standard (which is
also a standard of the Hellenic Organization for Standardization, ELOn, which specifies
the representation of Greek characters. This standard is ISO 8859-7 (also known as standard ELOT-928), and it specifies that each Greek character should be encoded in an 8-bit
byte. The standard is actually an extension of the usual ASCII representation, with the
Greek characters occupying the positions 128 to 254 (decimal).
Unfortunately, in the DOS world a different encoding than the one specified by the standard is used almost exclusively, and in the Apple Macintosh world yet another one is used.
Although there are tendencies for convergence to the standard, it is clear that a product
146
currently being developed and aiming at supporting all major platforms has to address the
peculiarities of each and every one of them. In the X -Window environment on UNIX
platforms, Greek characters usage is not yet widespread and in the course of the development special Greek fonts were produced, with the encoding specified in the ISO 8859-7
standard.
In the Greek Language Tools, an internal encoding scheme that was capable of representing all characters of the Greek language was defined. This way all internal processing and
data are platform-independent, and translation to the character encoding of the host
machine is implemented only for human interface reasons.
147
If a word is seen as a point in a multidimensional space of letter sequences, then the quest
for a correct word that is similar to an erroneous one, can be divided in three phases:
I. a neighbourhood of the input word is defined so as to avoid searching the whole pattern
space and instead limit the search where it is more probable to succeed,
2. this neighbourhood is traversed in a way that the most promising parts are checked first,
and
3. words found are compared with the given one and certain similarity criteria are
checked.
Knowledge of the way spelling errors were introduced in a document helps in order to
define this neighbourhood and the traversal criteria. Errors fall basically in three categories: author ignorance of correct spelling, typographical errors when text is typed, and
acquisition of data by the computer system. All of these errors can be modelled by four
basic operations, namely insertion, deletion, substitution, and transposition. These operations are useful in order to establish a way to measure the difference between two words.
Depending on the operational environment (OS, memory size, etc.), a number of design
choices can be made for the efficient implementation of a spelling checker. These design
choices for the overall implementation are discussed in the sequel.
Spelling correction
Thesaurus
Hyphenation
Lexica Tools
The architecture of these tools and the various modules that comprise them are described
below.
18.3.1 The Detector Engine
The Greek Spelling Checker is comprised by two functionally independent modules, the
detector module and the corrector module. They both operate on a spelling dictionary of
Greek words, that contains approximately 40000 lemmas.
The detector is the module that accepts as input a word, and tries to match it with the dictionary entries. If the word is found in the dictionary it is considered correct, otherwise it
is considered erroneous.
In a simplified description, the detector tries to decompose morphologically the input
word. The ending part of the word is checked against the set of valid suffixes, and for each
matching suffix, the detector searches the dictionary for the corresponding stem. When a
stem is found, the validity of the stem-suffix combination must be verified with respect to
suffix set, accent position etc.
148
Since the detector is the module that necessarily handles the whole input document (and
not only the misspelled words as the corrector does), its efficiency is critical for the whole
system. The following two-level search strategy was implemented, which highly speeds
the detector operation: When a word is given it is first checked against a table of most frequently used words and only if this look-up fails the word is morphologically decomposed
and searched for in the main dictionary. Since a few hundreds of words account for the
great percentage of words utilized in any document, there is high probability that this
search will succeed, thus avoiding a series of accesses to the main dictionary.
The organization of the table of most commonly used words is based on the trie data structure. The trie has the advantage that it compares a character at a time instead of whole
strings, and the number of comparisons needed equals the string length. This structure is
also used to organize the suffixes thus speeding up the morphological decomposition
phase described above.
ous word, only these dictionary entries that fall inside this word's neighbourhood. In order
to define such a neighbourhood, the erroneous word goes through a series of transformations producing a list of strings. These strings define neighbourhood in the dictionary. For
each entry in the area its distance from the erroneous word is calculated. If it is below a
suitable threshold, the entry is appended to a list of replacements of the erroneous word,
which will be proposed to the user.
The distance between words has a rather ambiguous content, which among others strongly
depends on the flavour of similarity of the individual user. This similarity can be described
in terms of the Damerau-Levenstein distance function. This approach considers as distance between two words the sum of the costs of the primitive editing operations (insertion, deletion, substitution, transposition) that were defined above.
The Damerau-Levenstein criterion does not compare other aspects of candidate words,
such as phonetic resemblance, accented syllables etc., that seem to contribute to the feeling of similarity. The approach chosen is based on Damerau-Levenstein with properly
adjusted costs for the editing operations. In addition, the words are compared in a form in
which there is no distinction between similarly uttered vowels.
The distance threshold against which candidate words are compared also requires subtle
treatment. If it is too high, the user will be flooded with suggestions, while if too low, there
will be many situations where the corrector will fail to suggest anything at all, simply
because the correct suggestion is outside the threshold. The threshold is therefore dynamically adjusted to the results of the dictionary search.
149
1. Since the definition domain of a hash code is much narrower than that of a regular
word, the number of resulting codes is significantly smaller than that of the words produced under the same transfonnations, resulting in a significantly lower number of disk
requests.
2. Since the disk image of the dictionary is organized by the hash codes of its entries, then
hash codes that neighbour according to the transfonnations of the correction phase also
neighbour physically, on disk. Page swapping is therefore reduced, with obvious gains
in the correction speed.
The two modules share also a memory management mechanism. The need for such a
mechanism becomes apparent if one considers the number of operations involved in
checking a word for correct spelling, and trying to suggest a similar one. The general purpose memory allocation routines introduce several problems: calls to allocate and release
memory to the operating system result in a considerable overhead that can not be ignored.
Furthennore, since general purpose routines ignore the nature of the data involved, they
can not apply any specific optimization algorithms in the way data are allocated or
released, and thus can not avoid the memory fragmentation that further undennines, the
optimal operation of the system.
The memory management system implemented has two main components:
150
1. A dictionary memory buffer, where the dictionary entries are loaded. It offers allocation
and release of blocks of arbitrary size and reorganization of the allocated blocks when
fragmentation (inevitably) becomes disturbing.
2. An object cache. Various data objects are frequently allocated and released. The cache
can hold released objects, up to a limit, and hence allocation requests are satisfied without using conventional allocation routines, unless the cache is empty.
151
3. Third, a lexicon records some aspect of the language or languages it covers. Since the
language evolves, a lexicon should be capable of following this evolution. Thus the
development of lexicon is a continuous process.
In order to construct the lexicon of the Greek Spelling Checker, the following steps were
undertaken:
1. Selection of lemmas. In order to rapidly develop a prototype of the lexicon and clarify
some of the issues involved, a corpus of 4000 lemmas was selected by experienced linguists. The criterion for selection was their frequency of usage, which due to lack of
statistical data, had to be judged by the linguists involved. After this step, a commercial
dictionary was selected and approximately 40000 lemmas were entered into the lexicon.
2. Grammatical information entry. For each lemma entered in the lexicon, a linguist provided the grammatical information related to it. This information includes inflectional
models, accentuation shifts and formation and recording of derived types, that are often
omitted as obvious from human lexica. The interface through which this information is
provided must account for user-friendliness and consistency. A special software, with
the appropriate interface, has been developed for this phase.
3. Lexicon validation. The lemmas entered into the lexicon have been exported along with
the complete information related to them. This information was extensively checked.
This process, necessarily carried out by the linguists involved, posed considerable
logistic problems.
4. Spelling Checker and Thesaurus lexicon. A tool for document processing has specific
requirements in performance, resources, etc., that make the general purpose gram matical lexicon not suitable for it. Instead, data structures more appropriate to the specific
exploitation of the lexicon must be designed, and the lexicon data must be projected on
these formats. A suite of transformation tools was developed for that purpose. On the
other hand a database meant to be the source of the various lexica has been developed.
152
has been originally underestimated. The data came in any kind of file fonnat, character
encoding and media, and conversion and translation effort, most of it necessarily with
human attendance, was almost always required. In acquiring the necessary texts, issues of
copyright and confidentiality proved to be a serious problem. Especially acquiring business related documents was an extremely lengthy process.
The acquired text, after passing through the filtering and conversion process in order to be
converted to a common fonnat, were fed to a modified UNIX version of the spelling
detector, that was communicating with the lexicon module. The lemmas or types found in
the lexicon were passed to the statistics gathering modules.
The parameters of major relevance to the spelling checker were the frequencies of lemmata and words, the frequencies of individual letters, digrams and trigrams, and the frequencies of suffixes and suffixes sets. The main results are summarized in the following:
The overall rejection ratio (percentage of words not found in the lexicon) was around 9%.
After subtracting geographical tenns and various linguistic extremes the rejection ratio
was about 3% and more than half of it represented words that had to be included in the
lexicon. About thirty percent represented diminutive types of noun or adjectives. The most
common of these tenns are incorporated in the lexicon, but a linguistic analysis of the phenomenon is in progress to evaluate the feasibility of facing the problem on a semi-algorithmic basis without undue increase of the lexicon size.
Concerning the common words dictionary, it was found that 40 lemmas have a cumulative
probability of more than 28%. The table of common words used by the spelling checker
has been modified on the basis of the findings.
18.6 Exploitation
The Greek Spelling Checker operates on all major computing platfonns, namely Apple
Macintoshes, DOS machines (IBM PCs and compatibles) and UNIX workstations. These
bindings and the exploitation plans are summarized in the sections that follow.
18.6.1 Macintosh
The Macintosh is perhaps the most popular platfonn in Electronic Publishing. Among the
most important software packages in this area is Quark XPress, a page layout program that
is extensively used for professional production of a variety of publications, ranging from
smallieaftets to daily newspapers. An important feature of Quark XPress is its open architecture. Based on the concept of code resources that are native to MacOS, XPress allows
integration with modules that communicate through a high-level functional interface. This
modules are called Quark Xtensions.
The Greek Spelling Checker was integrated with Quark XPress through this mechanism. It
appears to the user as a menu item, along with the menu items of corresponding tools for
English. When activated, it checks the document for errors, notifying the user for suspect
words, and suggests the candidates for correction when prompted by the user. It must be
noted that all these actions take place in seamless cooperation with the host application.
The interface to the user is consistent with the Apple Human Interface Guidelines,
153
whereas the suggestions of the spelling checker always retain the attributes of the original
text.
18.6.2 DOS
In the DOS arena, a diversity of document processing tools and even differences in platform make it difficult to integrate with any particular product For example, although the
Microsoft Windows is a successful environment, it is only partially established in Greece,
mainly due to its considerable resource requirements. On the other hand, existing word
processing or page layout products are rather "closed" as far as integration with other software is concerned.
These are the reasons why currently the Greek Spelling Checker exists for DOS only as a
stand-alone program, supporting text files. However, the company is working for a product aiming at the Windows platform.
18.6.3 UNIX
The document processing tools that were previously described benefit greatly from their
port to a UNIX platform.
The major improvement comes from the different architecture used for the tools: since
UNIX is a multiprocessing operating system, there was no need for the totality of the document processing tools to be a single program (process). Therefore, they were transformed
into a client-server architecture, where the servers run continuously in the background and
service requests from the clients. This client-server architecture is implemented using the
Remote Procedure Call (RPC) protocol, resulting in a completely distributed system.
There are distinct servers for each of the services the tools provide, i.e., detection, correction, hyphenation and thesaurus. However, it is perfectly legal to have a multitude of servers of the same kind running, each one servicing a specific kind of request. For example,
one can envision an environment where there is a dedicated spelling correction server for
words relevant to medical sciences, one for words used in legal documents, etc. These
servers are functionally equivalent; they just use specialized dictionaries. With this architecture, when a client issues a request, any server that knows how to handle it answers,
thus achieving minimum delay, since a client never waits for a busy server to be free,
while there is another server sitting idle.
Access to these servers is provided by either character-based clients, much in the manner
of the original UNIX spell program, and a graphical interface, based on the X Wmdow
System.
The X interface to the tools was tailored to communicate with FrameMaker, by Frame
Corporation, which is one of the best word processing and page layout software available
for UNIX workstations. The user uses FrameMaker to write his document, and the G~k
Spelling Checker tools to work on it. The words produced by the Greek Language ToolS
can be automatically incorporated in the FrameMaker document.
The intercommunication between FrameMaker and the Greek Spelling Checker XlI Front
End is automatic, and based on well-defined interface mechanisms, namely the x-Selec-
154
tion Buffer and RPC commands, that are supported among different versions of both the
X 11 system and FrameMaker.
Since there is no widely available font, for displaying Greek text in the X Window System
environment, Greek fonts according to the ISO 8859-7 encoding were developed. Furthermore, in order to be able to input Greek characters, a toggling mechanism between English and Greek keyboard states was implemented, based on the translation mechanism of
the X Toolkit, which is an integral part of the X Window System. The complete implementation of the Greek keyboard driver is accomplished solely through resources, without
any modification of the X Window System code.
18.6.4 Summary
The development work had the commercial exploitation built-in. The code for the various
modules has been designed to be portable to the extreme, physical memory requirements
are small for all implementations, and storage requirements for the lexicon are very modest (just over one megabyte). Also the lexica work went to great lengths to ensure that the
linguistic decisions made will be acceptable by the largest possible customer base. The
seamless binding to popular word processors and page layout products was also a target
from the beginning of the development.
Two such bindings exist currently, with XPress for the Macintosh world, and with FrameMaker for the UNIX platforms. The first one is basically a production grade implementation, while for the second one some product development work is still carried out. These
two implementations will be commercially available in June 1992 and September 1992
respectively, as shrink-wrapped packages.
The engines of the various tools will also be integrated in turnkey systems that the company is installing for the EditinglPublishing Sector. Such an integration has been carried
out for a daily newspaper editorial system, and another one is currently under progress for
a subtitling facility for a major Greek Video Studio.
Future plans include the lexicon improvement in coverage (70000 words), a WordPerfect
binding, and the supply of the same facilities in an integrated English-Greek environment.
1\vo prototypes were developed in the course of the Translator's Workbench project: the
fully integrated UNIX workbench l and a more reduced DOS version running under MS
Windows. For each of these versions some guidelines had to be followed by the partners
willing to integrate their software. The interface and integration activities are described in
the following chapter.
1. We would like to Ihank Thomas Glombik, SNI, for his work on Ihe FrameMaker integration.
158
both features necessary for designing individual user interfaces and special functions for
communicating with other Windows applications. By starting WinWord, the TAE interface is built up by a special macro which comes up with a dialog box asking the users
whether to use the TAE extensions or not Additionally, users can choose the languages
they wants to work with and arrange windows as they please.
Window
159
o
o
o
o
Fur welte,e Fr4gen 5lehen
MIt freundhchen GrUBen
Olgltdl Eqwpmenl Gmbh
.Germlln
ngllsh
rench
pllnlsh
GefUhrte
USB
In
Angebot
Uber
en
ruere
en.
Systeme
em
kAme
~hecklng Run
o !2lcllonllry lookup
o Olcllonllry E~enslon
Fig. 61: After calling the TAE proofreading tool, the user is asked to check some parameters
160
displaying an explanation or rule that is presented together with a correction proposal - see
Fig. 61 for example. The Proofreading application provides an explanation or rule for each
error
The Dictionary Extension mode allows adding multi-words or words in context to the dictionary. As the user has to type in information concerning the word in focus, the left and
right context and what possible misspellings could look like, this option should be used by
experienced users only. But if users want to add isolated words like e.g. idioms, names,
and trade marks, they can do it while running the Proofreading tool in one of the checking
modes: Whenever the first step of the Proofreading program comes up with a name the
system does not know yet, the Add... button can be pressed and the word (or whatever
word the user types in) can be added the the user dictionary. The message box displayed in
Fig. 62 enables the user to accept the suggested correction, ignore it and continue checking, or to cancel the checking run .
file
Edit
'lIew
tnsert
Formlll
Help
Utilities
",:.;;
bezugnehmend
.
erhaUern S,e be'he :
~:~~:~Chl::'S~~te
' 01
Mit freundlichen G i
O'gJlal Equ,pment
l htle
Gs<pr~ch
tiber
ein
Em
Anschlull !
durchaus In {yags
Fur weilere I'r age
} ysleme
k~mB
ll!1Zl!i!iiiii!.iiiil
Fig. 62: The proofreading application provides an explanation or rule for each error
161
that returns selected text plus fonnat infonnation. That means that it is impossible to conserve fonnat infonnation that has been inserted in the text before the checking run. At the
moment we are trying to find a good solution for this problem.
19.1.6 No More Printed Dictionaries Are Necessary - Everything Is Online
As described above, DDE communication is a good way of connecting WinWord to other
applications. Various external programs and functions can be built into WinWord this way.
A very powerful connection is the one from the TAE editor to the HFR Dictionary application, because the user can look up words of special purpose languages during the editing.
resp. translating session. Translations found within the dictionary can be copied to and
inserted into the text directly by the help of the Transfer button. As a detailed description
has already been given in the section on Lexica, no more details on the HFR Dictionary
application will be mentioned here. To get an overview of its functionality, see Fig. 63:.
[Wellp",,,erc;EffektM)
shllre wllrnmt
shere which is e
shllre with restricted .
share broker
shllreholder
shllreholder In II ml
shareholder of the
Imilbe,papoere;f),detP"l"fl!e
regrstered (nscnbed) stock. hecuitresJ;NIIIIICn(.~e
51aa1spafiere
SCCI.Iitie. ,'-Wertpapiere (II
marketable (b,,".Ierabiel .ecur~ . - Wertpapere 121
s/IaIes .tocks.boncls ::- StUcke @rclI
securities;stocks;ohareUlock. and bond. ::-
IIW'!ltI>alPl'lle@npl
(stock e>rchM!Je) .ec\.l~' :: .
bOr.eng~
On the left hand side, there is an index bar where the user can browse through the dictionary. As soon as an entry is selected, its translation and some further lexical infonnation
are displayed in the translation window on the right hand side. For looking up quickly,
there is also an edit line where the user can type in the word under consideration. Besides
162
the Transfer button there are two other function buttons which allow to copy text to
(Copy) and insert text from (Insert) the Clipboard. The menu bar of the HFR Dictionary
follows the SAA standard.
163
and -) und
specialize -) spezlalisiert
in Ihe manufac\ure of -) auf die Herslelluny
von
rut br
Skir
N01
Fig. 64: The translation memory demonsatrtor comes up with translation suggestions
164
common messages: Such a message is known and can be interpreted by all modules and
the TWB_MANAGER. They contain messages like initialization, terminating a module
and so on. Common messages should not be redefined by different modules because they
are executed in a special way.
private messages: These messages and their interpretation are not known to every other
module, but are restricted in most cases to the TWB_MANAGER and the appropriate
module (e.g. messages concerning the data transfer between the spelling checkers and the
TWB_MANAGER). This includes also messages which are sent from one module to
another module (via the TWB_MANAGER).
Additionally messages may be divided into task messages and answer messages. Task
messages contain a task or informations for other modules or the TWB_MANAGER.
Answer messegas represent the answer of one module for another module with regard to a
task message.
2. Please see the annex to this section for a definition of the message formats.
165
A module sends a message to another module. The message will not be received
DIRECTLY by the receiver module but will first be checked by the TWB_MANAGER.
TWB_MANAGER checks the content of the command and if it finds a command which
can be interpreted by the TWB_MANAGER it will execute it regardless of the receiver. It
follows that commands which can be interpreted by the TWB_MANAGER cannot be
used by other modules! When the TWB_MANAGER finds a command it cannot interpret
it will pass the message to the receiver of the message. In this case it is the responsibility
of the receiver to check if an allowed command has been specified by the sender. If not it
must send an error message to the sender. The following pictures illustrates this message
passing. Modules can only communicate with each other via the TWB_MANAGER, thus
allowing the manager to switch off a module when timeout problems occur.
TWB
Translator's Workbench
Esprit Project No. 2315
-----
0:~
Protocol
Within the init-phase the following procedure applies: First, the TWB_MANAGER starts
(executes) the appropriate module. Then TWB_MANAGER sensd a test command
166
(TWB_ACK) to the module using message passing. The module returns a message which
indicates wether the initialization has been done correctly (1WB_ACK_OK). After
receiving this message the lWB_MANAGER passes initial parameters to the module
using the lWB_INIT_MODULE command. The init-phase is supported by a special initcall in C (see technical annex).
In the next phase (execution phase) a set of messages is passed between the lWB_MANAGER and the module doing some module specific work.
Two possibilties for terminating a module exist In the first case the module signals that its
task has been finished sending a lWB_FINISH_MODULE message to the lWB_MANAGER. In the other case the lWB_MANAGER signals the module to terminate. This is
done by sending a lWB_END_MODULE message to the module. In both cases the module itself is not allowed to terminate without contact to the lWB_MANAGER. The module has to terminate when it receives a lWB_END_MODULE message. Before it
terminates it has to send lWB_MODULE_EXIT message to the lWB_MANAGER indicating that termination is correct.
167
Module Names
lWB_MANAGER lWB manager
lWB_CHECKER various spelling checkers
lWB_PRE_TRANSLATION pretranslation module
lWB_TERM_DATA_BASE term data base module
lWB_TRANSLATION translation module
lWB_EDITO editor module
lWB_ALL dummy module, specified when any of the above module
should be used as <sender>
20.
As pointed out in Chap. 4. user participation is the only means to guarantee that the software developed in the course of a software project reaches a certain quality standard.
However, the common understanding is that ..... a software product, as object of evaluation, does not lend itself (in the current state of the art of software engineering) to any
empirical-analytical investigation like the usual products of handicraft and industry"
(Christ et al. 19841ii). Thus the first step in developing a user-oriented evaluation approach
for 1WB was to define the term "quality" as precisely as possible.
.
.
execulton effiCIency
.
performance effiCIency
consistency
comprehensibility
task relevance
ease of use
ease of learning
clarity of layout
mnemonic labels
ratio of
searched/found terms
actual/detected errors
searched/found information
cat.
$uitability ofpresented
mformanon Rlr specific
purpose
correct/incorrect data output
content level
understandability of
help facility.
documentation
system messages
efficiency
usability
undo facility
escape function
error messages
error tolerance
checklist items
interrace level
WYSIWYG
number of failures
mean time to failure
time when failure occurs
number of cumulative failures
failure type
task adequacy
reliability
runctionallevel
MEASURED QUANTITY
correctness
CRITERIA
FACfOR
QUALITY
....
[o
~.
!a.
;t
170
171
Integrated TWBlUNIX
TOOL
EDITORFramemaker
Toolbox
Profile
Termbank
ACHIEVEMENTS
Translation
Memory
DEFICIENCIES
172
2. based on the UNIX version of MATE, the TELISMAN tool, an additional tool initiated by MB is being implemented by the University of Surrey. It covers a terminology
elaboration tool, a parallel corpus management tool, and a termbank, which can be
accessed from WinWord and thus be used by the translator during the translation process.
Stand-Alone TWBIUNIX
TOOL
ACHIEVEMENTS
DEFICIENCIES
Translation
Memory
- training facility
- improving quality after training
Cardbox
MATE
173
aspects which were put forward by the testing team were taken up in the specification of
1WB II, the follow-up project of 1WB on the PC platfonn.
PC Version TA
TOOL
extended
Speller
ACHIEVEMENTS
- integrated into WinWord
DEFICIENCIES
- integration into nonnal Winintegrates lexicon on typical Word spell checker not yet comerrors which cannot be handled by plete
ordinary spelling checkers such as - not all correction rules implemented
irregular plurals
capitalization
- overgeneration of corrections
punctuation etc.
for Gennan compounds
- includes grammatical rules
- program not yet fully stable
legaV
- integrated into WinWord
financelcom- retrieves selected words
merce dic
- shows alphabetic environment, if
tionary
word is not in lexicon
- allows alphabetic scrolling in
Fig. 69: Achievrnents and deficiencies of PC version TA
21.
Products
References
Ahmad, K., Fulford, H., Holmes-Higgin, P., Langdon, A., Rogers, M., (1989): A 'knowledgebased' record format for terminological data, Guildford: University of Surrey.
Ahmad, K., Fulford, H., Holmes-Higgin, P., Rogers, M., Thomas, P., (1990): The Translators
Workbench Project. In: C. Picken, ed, Translating and the Computer II, London: ASLm.
Ahmad, K., Fulford, H., Griffin, S., Holmes-Higgin, P., (1991), Thxt-based knowledge acquisition:
a language for special purposes perspective. In: I.M. Graham, and R.W. Milne, eds., Research and
Development in Expert Systems VIII, Cambridge: Cambridge University Press.
Ahmad, K., Fulford, H., Rogers, M., (forthCOming): The elaboration of special language terms: the
role of contextual examples, representative samples and normative requirements. Proc.
EURALEX, Thmpere, Finland, August 4-9, 1992.
Ahmad, K., Holmes-Higgin, P., Langdon, A., (1990): A computer-based environment for eliciting,
representing and disseminating terminology. Report of ESPRIT Project 2315, Translators Workbench, Guildford: University of Surrey.
Ahmad, K., Rogers, M., (in preparation): Knowledge Processing - An AI Perspective. Guildford:
University of Surrey.
Albl, M., Kohn, K.; Pooth, S., Zabel, R. (1990): Specification of terminological knowledge for
translation purposes. Report of ESPRIT Project 2315, Translator's Workbench. University of
Heidelberg
Albl, M., Kohn, K., Mikasa,., Patt, C., Zabel, R. (1991). Conceptual design oftermbanks for translation purposes. Report of ESPRIT Project 2315, Translator's Workbench. University of Heidelberg.
Alcina Fr., Juan, Blecua Jos Manuel (1991): Grarruitica Espaftola. Barcelona. Ariel.
Alford, M., (1971): Computer Assistance in Learning to Read Foreign Languages: An Account of
the Work of the SCientific Language Project, Cambridge: Literary and Linguistic Computing Cen-
tre.
Alonso, 1. (1990): Transfer Interstructure, designing and 'interlingua' for transfer-based MT Systems. Proc. 3rd Intern. Conf. on Theoretical and Methodological Issues in MT, Austin, TX.
Appelt, W. (1992): Document architecture in open systems: The aDA standard. Berlin: SpringerVerlag.
Atwell, E.S. (1987): How to detect grammatical errors in a text without parsing it. In: Proceedings
of the 3rd Conference of the European Chapter of the Association for Computational Linguistics.
Copenhagen, pp. 38ft".
Atwell, E.S. (1988): Grammatical analysis of english by statistical pattern recognition. In: Proceedings of the 4th International Conference on Pattern Recognition. Cambridge, UK, 1988, pp.
626ft".
176
References
Bahl, L.R., Jelinek, F., Mercer, R.L. (1983): A maximum likelihood approach to continuous
speech recognition. IEEE nansactions on Pattern Analysis and Machine Intelligence, PAMI Vol.5,
No.2, pp. 179-190.
Boehm, B.W. et al. (1978): Characteristics of Software Quality. 1RW Series on Software Technology, Vol. I, Amsterdam.
Boguraev, B., Briscoe, T., (eds.), (1989): Computational Lexicography for Natural Language Processing, Longman: London.
Borissova, E. (1988): 1\vo component teaching systems that understands and corrects mistakes. In:
Proc. 12th International Conference on Computational Linguistics (COLING 1988). Budapest
1988. Vol. I., pp. 68tJ.
Brown, P., Cocke, della Pietra, S., della Pietra, V., Jelinek, F., Mercer, F., Roossin, P., (1988): A
statistical approach to language translation. Proc. 12th International Conference on Computational
Linguistics (COLING) 1988, Budapest, pp. 71-76.
Brown, P., et al. (1990): A Statistical Approach to Machine nanslation. Computational Linguistics, Vol. 16, No.2, pp. 79-85.
Bush, V., (1945): As we may think. Atlantic Monthly, 176, No. I, 101-108, July 1945.
Caeyers, H., Adriaens, G. (1990): Efficient parsing using preferences. Proc. 3rd Intern. Conf. on
Theoretical and Methodological Issues in MT, Austin, TX.
Calzolari, N. (1989): The development of large mono- and bilingual lexical data bases, contribution to the IBM Europe Institute Computer based 1i"anslation of Natural Language: GarmischPartenkirchen.
Carbonell, 1. (1986): Requirements for Robust Natural Language Interfaces: The Language
CraftTM and XCALIBUR experiences, Proc. 11th International Conference on Computational
Linguistics (COLING) 1986, Bonn.
CCITT (1988): CITT Study Group VD, Message Handling System: X.400 Series of Recommendations.
CCITT (1988), Recommendation T.502: A Document Application Profile PMl for the interchange
of Processable Form Documents.
CEC (1991): ECHO User Manual.
CEN/CENELEC ENV 41 510 (1989): ODA Document Application Profile Q1I2 - Processable
and Formatted Documents - Extended Mixed Mode.
CEN/CENELEC ENV 41 509 (1989): ODA Document Application Profile Ql11 - Processable
and Formatted Documents - Basic Character Content
Chen, P. (1975): The Entity-Relationship Mode: toward a unified view of data. ACM nansactions
on Database Systems, Vol.l, No.1, 9-36.
Chomsky, N. (1956): Three models for the description of language. IRE Transactions on Information Theory, IT-2, pp 113-124.
References
177
178 References
Hellwig, P. (1980): PLAIN - A program system for dependency analysis and for simulating natural
language inference". In: L. Bolc (ed.): Representation and Processing of Natural Language.
Munich:Hanser & Macmillan, pp. 271ff.
Hellwig, P. (1986): Dependency Unification Grammar. In: Proc. 11th International Conference on
Computational Unguistics (COLING 1986). Bonn, pp. 195ff.
Hellwig, P., Bub, V., Romke, C. (1990): Theoretical basis for efficient grammar checkers. Part II:
Automatic error detection and correction. Appendix: Empiricial feasibility study. Report of
ESPRIT Project 2315, Translator's Workbench. University of Heidelberg.
Hellwig, P. (1988): Chart parsing according to the slot and filler principle. In: Proceedings of the
12th International Conference on Computational Unguistics (COLING 1988). Budapest: \tJ1. I.,
pp.242ff.
Hellwig, P. (1989): "Parsing natiirlicher Sprachen: Grundlagen - Realisierungen". In: I.S. Batori,
W. Lenders, W. Putschke (eds.): Computational Unguistics. An International Handbook on Computer Oriented Language Research and Applications. Handbiicher zur Sprach- und Kommunikationswissenschaft vol. 4, pp. 348ff. Berlin: De Gruyter 1989
Heyer, G. (1990): Probleme und Aufgaben einer angewandten Computerlinguistik. In: KI 1/90
Herbst R., Readett, A.G. (1985): Dictionary of Commercial, Financial and Legal Thrms: Englisch
- German - French, Volume I, Translegal, Thun, Switzerland, 1985
Herbst R., Readett, A.G. (1989): Worterbuch der Handels-, Finanz- und Rechtsspache: DeutschEnglisch - Franzosisch, Band II, Translegal, Thun, Switzerland, 1989
Herbst R., Readett, A.G. (1987): Dictionnaire des termes Commerciaux, Financiers et Juridiques,
Francais - Anglais - Allemand, Volume III, Translegal, Thun, Switzerland, 1987
Hoge, M., Wiedenmann, 0., Kroupa, E. (1991): Evaluation of the lWB. Report of ESPRIT
Project 2315, Translator's Workbench. Mercedes Benz AG.
Hoge, M., Hohmann, A, Mayer, R., Smude, P. (1992): Evaluation of the Translator's WorkbenchOperationalization and lest ReSUlts. Report of ESPRIT Project 2315, Translator's Workbench.
Mercedes Benz AG, Stuttgart.
Holmes-Higgin, P., (1992), The machine assisted terminology elicitation environment, AI International Newsletter, Issue 3, pp. 9-11.
Holmes-Higgin, P., Griffin, S., (1991): MATE User Guide, Guildford: University of Surrey.
ISO 7498 (1984): Information processing systems. Open systems interconnection. Basic reference
model.
ISO 8613 (1989): Information ProceSSing - lext and Office Systems - Office Document Architecture (ODA) and Interchange Format, Parts 1-8.
ISO 882418825 (1988): Abstract Syntax Notation I, ASN.1.
ISO 8879 (1986): Information Processing - Text and Office Systems - Standard Generalized
Markup Language, SGML.
References
179
ISO 10021 (1988): Message Oriented Thxt Interchange System, Parts 1-7.
ISO DISP 10610 (1991): FOD-ll , Profile for the Interchange of Basic Function Character Content
Documents.
ISO DISP 11181 (1991): FOD-26, Profile for the Interchange of Enhanced Function Mixed Content Documents.
ISO DISP 11182 (1991): FOD-36, Profile for the Interchange of Extended Function Mixed Content Documents.
Jensen, K., Heidorn, G.E., Miller, L.A., Ravin, Y. (1983): Parse fitting and prose fixing: getting a
hold on ill-formedness. In: Computational Linguistics 9/3-4, pp 1471f.
Keck, B. (1989): Theoretical study of a statistical approach to translation. Report of ESPRIT
Project 2315, Translator's Workbench. Fraunhofer Society lAO Stuttgart.
Keck, B. (1991): Translation Memory: A translation aid system based on statistical methods.
Report of ESPRIT Project 2315, Translator's Workbench. Fraunhofer Society lAO Stuttgart.
Kittredge, R., and Lehrberger, 1. (eds), 1982: Sublanguage: Studies of Language in Restricted
Semantic Domains. Berlin: DeGruyter.
Knops, U., Thurmair, G. (1993): Design of a multifunctional lexicon. In: H.B. Sonneveld,
K.G.Loening, eds., Terminology: Applications in interdisciplinary communication. Amsterdam: J.
Benjamins B.V.
Knuth, D.E. (1973): The Art of Computer Programming, Volume 3: Sorting and Searching, Addison-WeSley.
Kohn, K. (1990a). Translation as conflict. In P. H. Nelde (ed.). Confli(c)t. Proceedings of the International Symposium 'Contact+Confli(c)t', Brussels, 2-4 June 1988. Association BeIge de Linguistique Applique: ABLA Papers 14, 105-113.
Kohn, K. (l990b). Terminological knowledge for translation purposes. In R. Arntz, G. Thome
(Hrsg.). Ubersetzungswissenschaft. Ergebnisse und Perspektiven. Festschrift fUr Wolfram Wilss
zum 65. Geburtstag. TIibingen: Narr, 199-206.
Krause, J. (1988): Sprachanalytische Komponenten in der Thxtverarbeitung. Silbentrennung,
Rechtschreibhilfen und Stilpriifer vor dem Hintergrund einer kongnitiven Modellierung des
Schreibprozesses. In: J. Rl)hrich (ed.): Handbuch der Textverarbeitung.
Kudo, I., Koshino, H., Chung, M., Moromoto, T. (1988): Schema method: a framework for correcting grammatically ill-formed input. In: Proc. 12th International Conference on Computational
Linguistics (COLING 1988). Budapest, pp. 3411f.
Longhurst, J., Mayer, R., Smude, P. (1991): Cardbox. Report of ESPRIT Project 2315, Thanslator's
Workbench. Fraunhofer Society lAO Stuttgart.
Longman Dictionary of Contemporary English (1987): Essex: Longman.
Lothholz, K1. (1986): Einige Uberlegungen zur Ubersetzungsbezogenen Thrminologiearbeit,
TEXTconTEXT 3, 193-210.
180 References
Marchionini, G.; Shneidennan, B. (1988): Finding Facts vs. Browsing Knowledge in Hypertext
Systems. IEEE Computer 21(1),70-80.
Mayer, R. (1990): Benutzerschnittstelle fiir eine Thrminologiedatenbank. In: Endres-Niggemeier,
B. et al. (Hrsg.): Interaktion und Kommunikation mit dem Computer. Proc. GLDV-Jahrestagung.
Informatik Fachberichte 238. Berlin: Springer-Verlag.
Mayer, R. (1990): Investigation of Retrieval Strategies. Report of ESPRIT Project 2315, 1tanslator's Workbench. Fraunhofer Society lAO Stuttgart.
McCall, l.A., Richards, P.K., Walters, G.F. (1977): Factors in Software Quality. Vol 1: Concepts
and Definitions of Software Quality. Springfield.
Mellish, C. (1989): "Some chart-based techniques for parsing ill-formed input". In: Proc. 27th
Annual Meeting of the Association for Computational Unguistics (ACL), Vancouver, pp. 341ff.
Meya, M., Rodellar, X (1992): Evaluation of the Spanish Spell Checker.
Project 2315, Translator's Workbench. CDS Barcelona.
Report of ESPRIT
Microsoft Corp. (1992): New directions for Microsoft. Do all roads lead to Redmond? In: Language Industry Monitor No.7, January - February 1992.
Okuda, T., 'Thnaka, E., Kasai, T. (1976): A Method for the Correction of Garbled Words Based on
the Levenstein Metric, IEEE nansactions on Computers, pp. 172-178, February 1976.
Patterson W., Urrutibeheity, H. (1975): The lexical structure of Spanish. The Hague: Mouton.
Partee, B. H., et al. (1990): Mathematical Methods in Unguistics. Dordrecht 1990.
Peterson, J.L. (1980): Computer Programs for Detecting and Correcting Spelling Errors, Communications of ACM, pp. 676-687, December 1980.
Picht, H., Draskau, J., (1985): Thrminology: An Introduction, Guildford: University of Surrey.
PODA (1988): (ESPRIT Project 1024), Stored ODA (SODA) Interface.
Rimon, M., Hen, 1. (1991): The Recognition Capacity of Local Syntactic Constraints. EACL, pp.
155-160.
Romano 1. (1984): Un sistema automlitico de Slntesis de habla mediante semisllabas. in: Boletln 2
Procesamiento del Lenguaje Natural. San Sebastian.
Saras, J.A., et al. (1988): CACTUS: Opening X.400 to the low cost PC world, in: Proc.
ESPRIT88, North-Holland.
Schmitt, H., (1989): Writing Understandable Technical Thxts. Report of ESPRIT Project 2315,
nanslator's Workbench. Siemens Nixdorf, Munich.
Schnelle, H. (1991): Beurteilung von Sprachtechnologie und Sprachindustrie, Colloquium
Sprachtechnologie und Praxis der maschinellen Sprachverarbeitung, Reimers-Stiftung: Bad Homburg.
Schwanke, M. (1991): Maschinelle Ubersetzung, ein Uberblick tiber Theorie und Praxis. Berlin:
Springer-Verlag.
References 181
Schwind, C. (1988): Sensitive parsing: error analysis and explanation in an intelligent language
tutoring system. In: Proc. 12th International Conference on Computational Linguistics (COLING
1988). Budapest, pp. 341ff.
Sinclair, J., ed., (1988): Collins Cobuild English Language Dictionary, London and Glasgow.
Taul, M., Alonso, J.A. (1990): Specification of String Syntax for Spanish. Report of ESPRIT
Project 2315, Translator'S Workbench. CDS Barcelona.
Thunnair, G. (199Oa): "Parsing for grammar and style checking". Proc. 13th International Conference on Computational Linguistics (COLING) 1990, Helsinki.
Thunnair, G. (l990b): Complex lexical transfer in METAL. Proc. 3rd Intern. Conf. on Theoretical
and Methodological Issues in MT, Austin, TX.
Thunnair, G. (199Oc): METAL, Computer integrated translation. In: J. McNaught, ed.: Proc.
SALT Workshop 1990, Manchester.
Thunnair, G., (199Od): Style checking in TWB. Report of the ESPRIT Project 2315, Thmslator's
Workbench. Siemens Nixdorf Munich.
Thunnair, G., (199Oe): Parsing for grammar and style checking. Proc. 13th International Conference on Computational Linguistics (COLING) Helsinki.
Thunnair, G., (1992): Verification of controlled grammars. Proc. Workshop Bad Homburg. 1b
appear
Thrba, T.N. (1981): Checking for spelling and typographical errors in computer-based text, SIGPLAN Notices, pp. 298-312, June 1981.
Ullman, J.R. (1977): A binary n-gram technique for automating correction of substitution, deletion, insertion and reversal errors in words, The Computer Journal, pp. 141-147.
Weaver, W. (1955):Translation. In: W.N. Locke and AD. Booth (eds).: Machine translation oflanguages. Cambridge, MA: MIT Press.
Weischedel, R.M., Sondheimer, N.K. (1983): Meta-rules as a basis for processing ill-fonned
input". In: Computational Linguistics 9/3-4 (1983), pp 161ff.
Williams, Gr. (1987): HyperCard. BYTE 12, 109-117, 1987.
Winkelmann, G. (1990): Semiautomatic Interactive Multilingual Style AnalysiS. Proc 13th International Conference on Computational Linguistics (COLING) 1990, Helsinki.
Voliotis, A, Zavras, A (1990): Extending 4.3BSD to support greek characters, Proceedings of the
European Unix Users Group Conference in Nice, France, Fall 1990.
Yang, H., (1986): A new technique for identifying scientific/technical terms and describing science
texts. Literary and Linguistic Computing 1 (2), pp. 93 - 103.
Zimmennann, H. (1987): Textverarbeitung im Rahmen moderner Btirokommunikationstechniken.
PIK 10, Munchen, pp. 38-45.
Index of Authors
Ahmad 8,59
Albl6,100
Davies 8, 59
Delgado 29
Dudda 112
Le Hong 4,168
Mayer 49, 83
Menzel 83
Meya 29,110,121
Rogers 16,67
Fulford 8, 59
Stallwitz 157
Hellwig 128
Heyer 3, 24, 40
Htlge 4, 168
Hohmann 4,168
Holmes 8, 59
Hoof 49,83
Thurmair 16,75,117,123
WaldMr 40, 157
Winkelmann 117
Zavras 145
Karamanlakis 145
Keck 83
Kese 112
Kleist 157
Kohn 6,100
Kugler 83,109,112,157
Spri nger-Verlag
and the Environment
We
We
ness partners - paper mills, printers, packaging manufacturers, etc. - to commit themselves
to using environmentally friendly materials and
production processes.
The paper in this book is made from
Iow- or no-chlorine pulp and is acid free, in
conformance with international standards for
paper permanency.