Collaborative Semantic Editing of Linked Data Lexica: John Mccrae, Elena Montiel-Ponsoda, Philipp Cimiano
Collaborative Semantic Editing of Linked Data Lexica: John Mccrae, Elena Montiel-Ponsoda, Philipp Cimiano
Collaborative Semantic Editing of Linked Data Lexica: John Mccrae, Elena Montiel-Ponsoda, Philipp Cimiano
2619
They conclude that crowd-sourcing enables systematic, Markup is not used consistently in the sense that
large-scale judgement studies that are more affordable and tags, such as those for part-of-speech, are not
convenient than ... lab-based studies. A clear concern with used in the same meaning and same purpose
eliciting such information from non-expert users is the cor- across languages.
rectness of the results. However, studies have found that by There is no proper typing for the markup, such
combining the annotations of multiple non-expert annota- that the same type of markup is used to mark very
tors the resulting annotated data is comparable to the data different linguistic properties of lexical entries,
provided by an expert annotator (see (Snow et al., 2008; e.g. using the same type of tags to specify that
Hseuh et al., 2009)). a lexical entry is archaic (representing prag-
One of the most prominent examples of such crowd- matic knowledge related to the usage of that en-
sourcing is the online encyclopedia Wikipedia 1 and its dic- try) or that a lexical entry is uncountable (a
tionary version Wiktionary. There have already been sev- lexico-syntactic property).
eral attempts to create a structured resource out of Wik-
tionary (Zesch et al., 2008a; McCrae et al., 2012b) and Lack of expressivity
to apply the data contained within this resource to NLP
tasks such as information retrieval (Muller and Gurevych, Senses are often employed in multiple roles in
2009) and semantic relatedness (Zesch et al., 2008b). As an entry, e.g. in giving definitions and provid-
resources created by crowd-sharing are not typically devel- ing translations. However, there is no explicit ID
oped for machine processing, it is a priori not clear how assigned to theses senses. Thus, individual uses
useful such crowd-sourced resources are for NLP tasks. A of a sense cannot be consolidated. For example,
topological comparative study of Wiktionary as a resource for the entry cat a definition of Any similar an-
was carried out by Meyer et al. (2010). They found that imal of the family Felidae, which includes lions,
Wiktionary was similar in structure to resources created by tigers etc. is given, but a set of translations is
experts (in this case GermaNet and Open Thesaurus), but given under the definition member of Felidae
had fewer semantic links per resource than such resources. and a set of synonyms are given under the head-
They also reported that there were many technical issues in ing any member of Felidae. While it is easy for
Wiktionary concerning broken links and axiom violations, a human to see that these elements are equivalent,
e.g., the indication of a synonymy between two pages in it is a non-trivial task for a machine. An example
only one direction, which violates the symmetric axiom for of this is given in Figure 1.
synonymy. McCrae et al. (2012b) presented a study in inte- Links to other pages do not specify the particular
grating WordNet with Wiktionary, showing that there was entry or definition that is relevant. For example,
only an approximately 25% overlap at the level of lexical the English bank links to the German Bank
entries between both resources. This suggests that combin- but it is not specified whether the translation is
ing these resource may thus be very valuable. the entry with plural Banken (which is correct),
or the entry with plural Banke (which is erro-
2.2. Wiktionary
neous and means bench).
In this paper, we take Wiktionary as a representative exam-
ple of a collaboratively edited dictionary. Wiktionary cur- Technical inconsistencies
rently consists of 2.8 million entries (380,000 in English
alone)2 . It would be very useful if Wiktionary could be There are often small technical errors. For ex-
used to support NLP applications. However, the MediaWiki ample, in certain places either ISO 639 codes or
markup only provides some weak semantic information, as language names (in English) may be used to in-
the main purpose of this markup is to display the entries in dicate the language of a translation.
a uniform manner. Therefore, when attempting to automat-
The above issues reveal that we need a sound and
ically process the markup, a number of issues occur:
linguistically-motivated data model that solves some of
Implicit Semantics these issues, in particular introducing IDs for senses that
can be referred to when specifying translations etc. We
The markup is mainly used for display purposes
propose to use the lemon model for this, which is briefly
and linguistic knowledge is hidden behind pro-
described in the next Section.
cedures that render certain templates in an appro-
priate way. However, this linguistic knowledge, 2.3. The lemon model
being implicit only, is difficult to access and ex-
The lemon model (McCrae et al., 2012a) is a proposed
ploit. This is the case of inflectional markup, e.g.
model for the representation of ontology-lexica as linked
plural formation. The markup often also has ad-
data. The lemon model builds on a number of existing
ditional parameters the semantics of which is not
standards for the representation of lexica, such as the Lex-
well-defined.
ical Markup Framework (Francopoulo et al., 2006) and the
Lack of Consistency SKOS model (Miles and Bechhofer, 2009) for represent-
ing terminologies. With lemon we had several key design
1
http://www.wiktionary.org goals: Firstly, the model is based on RDF, as this is the
2
Based on dump dated 08/10/2011 standard method for distributing linked data on the web.
2620
Figure 1: An example of a Wiktionary entry. The highlighted element indicate semantically related elements that use
different markup and labelling. Some content has been removed from this page for illustrative purposes.
2621
needs to be introduced by assigning a unique ID to AJAX applications. We employed an extension that allows
the corresponding object. Displaying the associated a help message to be shown on the currently selected ele-
ID (for example bank sense3) would not be mean- ment. We also implemented a journalling repository mech-
ingful to the user who would clearly prefer a human- anism that allows changes to be tracked and logged to the
readable version of the object. user that made the modification. User management itself is
managed via OpenID, allowing users to use accounts from
Modelling artifacts require special rendering: For providers such as Google and Yahoo!. Finally we imple-
the representation of certain elements of the model, mented a linguistic data category ontology interface that
specialised data structures are required. For exam- uses the LexInfo model5 , which is itself derived from the
ple, in the case of RDF, a linked list is required to ISOcat registry (Kemps-Snijders et al., 2008) and is further
provide ordered data. Domain-specific data structures described in McCrae et al. (McCrae et al., 2011).
(e.g. trees or ordered sequences) need to be rendered
intuitively by the editor, but this cannot be done in a 3.2. Automatic lexicon creation
generic manner. The lemon source application supports automatic lexicon
Consistency of logical units must be maintained: creation by means of employing existing NLP systems such
Some elements in a lexicon should be created and ma- as part-of-speech taggers, parsers, tokenizers etc. This sys-
nipulated as a single unit even though they correspond tem, described in McCrae et al. (2011), allows a lexicon to
to multiple elements in the data model. For exam- be created from an input ontology or Linked Data resource
ple, subcategorization frames should be created with automatically. Naturally, the system only allows for auto-
an appropriate argument structure and given a seman- matically generated entries to be added to private users to
tic mapping from the entrys sense. their set of private lexica, as the result may contain errors
and as such requires manual review. This system is imple-
In addition, as with other lexica models such as LMF (Fran- mented by means of a blackboard architecture so that each
copoulo et al., 2006), there is much specialised terminology step of the process is implemented independently, allowing
that is not clear to those without expertise in lexicography new tasks to be created with ease. Currently, the following
or familiarised with the model. For this reason, it is impor- processing steps are applied:
tant to provide built-in help that explains and makes acces-
sible the terminology to naive users. Label Extraction: The goal of this pre-processing
For the above reasons, we thus decided to create a new ap- step is to yield a human-readable label for each on-
plication from scratch that could provide a clean and intu- tological element in the ontology or Linked Data re-
itive user interface to lemon to be used by non-experts to source. Ontologies on the Web and linked data re-
create correct lexica. In addition, we defined the following sources differ in how they express lexical information
technical requirements: about resources. Some use the rdfs:label prop-
erty, other use foaf:name or even other proprietary
There should be help throughout the system so that the properties. Thus, specialized procedures are required
definition of concepts can be provided to non-expert for label extraction. As ontologies often lack also lan-
users. guage information, we also employ a language iden-
tification approach and techniques that extract labels
There should be support for private working spaces for
from URIs to identify an appropriate label for each re-
lexica and tracking of changes and status of the lexica.
source.
The re-use of data categories such as ISOcat (Kemps-
Tokenization: Many labels consist of multiple words
Snijders et al., 2008) should be fostered.
that, however, are not separated by blanks. Thus, some
The system should have a model-view-controller ar- special heuristics are needed to tokenize labels.
chitecture, where the model is the RDF store contain-
ing the data. Parsing: If the label consists of multiple words, we
apply a part-of-speech tagger and a parser if there is
The data should be accessible by linked data princi- one available for the language in question. Otherwise,
ples. In particular, RDF data should be available by this step is skipped.
means of transparent content negotiation (Holtman
and Mutz, 1998) and a SPARQL endpoint should be Tagging: If a parser is not available, a part-of-speech
available. tagger is applied to infer part-of-speech of the compo-
nent words.
Our system is designed following the model-view-control
pattern where the model is stored in a triple repository. Merging: The generated entries are compared against
Each modification to the lexicon at the UI is automatically the entries in legacy resources such as WordNet or
mapped to corresponding changes in the backend, which is Wiktionary in order to find duplicates. If duplicates
implemented by a Virtuoso repository3 . For the UI we rely are found, the entries are replaced by a URI represent-
on the jQuery library4 , which supports the easy creation of ing the lexical entry in the legacy resource.
3 5
http://virtuoso.openlinksw.com/ http://www.lexinfo.net/ontology/2.0/
4
http://jquery.com/ lexinfo
2622
Figure 3: An example screenshot of lemon source
Morphological Analysis: Based on the part-of- The owner must be able to publish a lexicon when it
speech and canonical form of the entry, a morphologi- has reached a status where it can be published on the
cal pattern is applied from a pre-loaded set of morpho- Web.
logical patterns.
In order to meet these requirements, each entry can be as-
Categorization: In this step a set of specialized rules signed a status from the ones listed below. A manager can
are applied to the parse tree in order to extract the sub- then assign roles to the different editors so that they can col-
categorization frame of the entry as well as its head. laboratively edit the lexica. Once the set of data has reached
the Accepted status, it can be considered ready for publi-
3.3. Collaborative editing cation.
In this section we present the application we have created to
support the creation and collaborative editing of the lexicon 4. Evaluation
associated to ontologies, as well as its publication as linked
For the evaluation of the lemon source editor, we performed
data. As can be seen in Figure 3, the first step consists in
an evaluation focused on the usability and coverage of the
importing an ontology or linked data resource. Then, the
model and the tool. On the one hand, our objective was to
application offers the possibility to automatically create a
find out in how far the lemon model is capable of represent-
preliminary version of the lexicon based on the natural lan-
ing the lexical data contained in a resource such as Wik-
guage information associated with the ontology elements.
tionary and whether the model matches the requirements
After this initial step, users can already navigate through
that users have for their applications. On the other hand, our
the automatically derived lexicon and directly edit it.
purpose is to know users opinions on how usable the sys-
As users of the system are intended to be a mixture of peo-
tem is, if the resulting lexicon is as intended, if they easily
ple with different degrees of linguistic knowledge, we re-
understand the lexical information captured in the model, if
quire that the system provides a number of features for col-
they find it easy to edit, and if the collaborative functional-
laborative editing, based on those found in classical data
ity helps them in creating lexica in an intuitive manner. For
wikis as well as standard practise in the language resource
these purposes, we conducted an initial set of evaluations
community. As such we formulate the following require-
with five Masters students, three studying Computer Sci-
ments:
ence, one studying Linguistics and one studying Cognitive
Support for monitoring changes to a page by means of Science. They were given a short explanation of the sys-
a change tracker on the RDF model that monitors any tem and allowed to work with the system for about an hour.
changes to an entry and displays them to the user. They were given the task of representing a single entry from
Wiktionary for a common term (hence an entry with much
An area for each entry where users can make com- information) within lemon source. Afterwards, they were
ments and discuss any details of an entry should there asked to answer a questionnaire with ten questions as fol-
be disagreement. lows (partially abridged):
Private working spaces for lexica where one or sev- 1. Did you find the system easy to use?
eral users may create their own lexica and manage the
status of that data before deciding to publish it. 2. Was the lexical information presented easy to under-
stand?
Statuses for each entry that can be assigned by its ed-
itors, such as Rejected, Accepted, For review 3. Were you able to represent all information you re-
or Automatically generated. These statuses are dis- quired?
played not only for each entry but also in a summary 4. Was the built-in help functionality adequate?
of all statuses in the lexicon.
5. Was it straightforward to learn how to use the system?
It must be possible to define groups of users who can
work on a particular private lexicon and each lexicon 6. Were the user interface elements clear and understand-
should have an owner. able?
2623
Sheet2
Total
This system takes inspiration from the classical document
based wiki approach but extends this with a structured and
Q10
linguistically sound data model. We argue that, on the one
Q9 hand, the data created by classical wikis lacks sufficient se-
Q8
mantics to be useful for many text processing applications,
and, on the other hand, generic data-driven editors would
Q7
N/A be difficult to use for non-expert users. Therefore, we ar-
Negative
Q6
Mixed
gue that for complex language resources, such as lexica, it
Q5
Positive is necessary to create custom user interfaces that support
the creation of high-quality ontology lexica. The system
Q4
we have presented in this paper, lemon source, is a web-
Q3 based tool that allows users, comprising both experts and
Q2
non-experts, to collaboratively create an ontology-lexicon
semi-automatically based on an automatically created lexi-
Q1
con. An evaluation of the system has shown that the system
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% is usable, but has revealed that appropriate documentation
is a key issue that needs to be addressed for the system to
Figure 4: Results of usability study by question be improved.
7. Was there too much to read before you could start the
Acknowledgements
system? This work was developed in the context of the Monnet
project, which is funded by the European Union FP7 pro-
8. Was the resulting lexicon as intended? gram under grant number 248458 and the CITEC excel-
9. Did straightforward tasks (such as creating a lexical lence initiative funded by the DFG (Deutsche Forschungs-
entry with associated subcategorization frame) require gemeinschaft).
too many steps?
6. References
10. Did the collaborative functionality help?
S. Auer, S. Dietzold, and T. Riechert. 2006. OntoWikiA
The five users were given time to work with the system tool for social, semantic collaboration. In Proceedings of
and then asked to complete the questionnaire. The results the 5th International Semantic Web Conference (ISWC
are presented in Figure 4 andPage show
1 that out of a total of 2006), pages 736749. Springer.
50 responses 34 answers were positive, 6 were negative, T. Berners-Lee. 2009. Linked data-the story so far. Inter-
9 were mixed6 and 1 was not applicable (the user did not national Journal on Semantic Web and Information Sys-
use the collaborative tools). In general, the results were tems, 5(3):122.
positive, and the users were mostly satisfied with the lay-
S. Bird and M. Liberman. 2001. A formal framework for
out of the system, in part due to its similarity to existing
linguistic annotation. Speech Communication, 33(1):23
Wiki platforms. Most of the negative comments referred
60.
to bugs with the system, for example errors in handling
P. Buitelaar. 2010. Ontology-based Semantic Lexicons:
strings with apostrophes that are easy to solve. One partic-
Mapping between Terms and Object Descriptions. In
ular concern that was mentioned (particularly in response
Chu-Ren Huang, Nicoletta Calzolari, Aldo Gangemi,
to Q2+Q3) was that finding particular linguistic properties
Alessandro Oltramari, Alessandro Lenci, and Laurent
or categories was difficult, in particular, one user noted that
Prevot, editors, Ontology and the Lexicon: A Natural
he/she could not model a verb form as a participle, which
Language Processing Perspective. Cambridge Univer-
can be modelled via a property verb form mood. As such
sity Press.
we intend to introduce a search function for linguistic prop-
P. Cimiano, P. Buitelaar, J. McCrae, and M. Sintek. 2011.
erties to enable users to find the correct modelling. One of
LexInfo: A Declarative Model for the Lexicon-Ontology
the CS users found the system to be very difficult to use as
Interface. Web Semantics: Science, Services and Agents
he was unfamiliar with most of the linguistic terminology
on the World Wide Web, 9(1).
such as homonym and phrase tree. Due to the criticism
that the systems lacks documentation, in particular what the H. Cunningham, V. Tablan, K. Bontcheva, M. Dimitrov,
description of linguistic categories is concerned, we intend and Ontotext Lab. 2003. Language Engineering Tools
to increase written documentation of the interface and cre- for Collaborative Corpus Annotation. In Proceedings of
ate a video introduction on the main page. Corpus Linguistics 2003, pages 8087.
G. De Melo and G. Weikum. 2008. Language as a foun-
5. Conclusion dation of the semantic web. In Proceedings of the Poster
We presented a system that is intended for the collabora- and Demonstration Session at the 7th International Se-
tive creation of linked data lexica using the lemon model. mantic Web Conference (ISWC).
C. Draxler. 2006. Web-based speech data collection and
6 annotation. In Proceedings of Speech and Computer
The user expressed comments such as Mostly but... or ex-
pressed negativity about a small part of the system. (SPECOM2006), pages 2734.
2624
G. Francopoulo, M. George, N. Calzolari, M. Mona- tation. In Proceedings of the Fifth International Confer-
chini, N. Bel, M. Pet, and C. Soria. 2006. Lexical ence on Language Resources and Evaluation (LREC06).
markup framework (LMF). In Proceedings of the Inter- T. Zesch, C. Muller, and I. Gurevych. 2008a. Extract-
national Conference on Language Resources and Evalu- ing lexical semantic knowledge from wikipedia and wik-
ation (LREC06). tionary. In Proceedings of the Conference on Language
K. Holtman and A. Mutz. 1998. Transparent Content Ne- Resources and Evaluation (LREC), pages 16461652.
gotiation in HTTP. RFC 2295 (Experimental), March. T. Zesch, C. Muller, and I. Gurevych. 2008b. Using wik-
P.Y. Hseuh, P. Melville, and V. Sindhwani. 2009. Data tionary for computing semantic relatedness. In Proceed-
Quality from Crowdsourcing: A Study of Annotation Se- ings of AAAI, volume 2008, page 45.
lection Criteria. In Proceedings of the NAACL HLT 2009
Workshop on Active Learning for Natural Language Pro-
cessing, pages 2835.
N. Ide and K. Suderman. 2007. Graf: A graph-based for-
mat for linguistic annotations. In Proceedings of the Lin-
guistic Annotation Workshop, pages 18. Association for
Computational Linguistics.
M. Kemps-Snijders, M. Windhouwer, P. Wittenburg, and
S.E. Wright. 2008. ISOcat: Corralling data categories in
the wild. In Proceedings of the International Conference
on Language Resources and Evaluation (LREC08).
J. McCrae, D. Spohr, and P. Cimiano. 2011. Linking lex-
ical resources and ontologies on the semantic web with
lemon. The Semantic Web: Research and Applications,
pages 245259.
J. McCrae, G. Aguado-de Cea, P. Buitelaar, P. Cimiano,
T. Declerck, A. Gomez-Perez, J. Gracia, L. Hollink,
E. Montiel-Ponsoda, D. Spohr, and T. Wunner. 2012a.
Interchanging lexical resources on the Semantic Web.
Language Resources and Evaluation.
J. McCrae, E. Montiel-Ponsoda, and P. Cimiano. 2012b.
Integrating WordNet and Wiktionary with lemon. In
C. Chiarcos, S. Nordhoff, and S. Hellmann, editors,
Linked Data in Linguistics. Representing and Connect-
ing Language Data and Language Metadata, pages 25
34.
C. Meyer and I. Gurevych. 2010. Worth its weight in
gold or yet another resourcea comparative study of wik-
tionary, openthesaurus and germanet. Computational
Linguistics and Intelligent Text Processing, pages 3849.
A. Miles and S. Bechhofer. 2009. SKOS Simple Knowl-
edge Organization System reference. W3C Recommen-
dation. Technical report, World Wide Web Consortium.
C. Muller and I. Gurevych. 2009. Using wikipedia
and wiktionary in domain-specific information retrieval.
Evaluating Systems for Multilingual and Multimodal In-
formation Access, pages 219226.
R. Munro, S. Bethard, V. Kuperman, V.T. Lai, R. Melnick,
C. Potts, T. Schnoebelen, and H. Tily. 2010. Crowd-
sourcing and language studies: the new generation of
linguistic data. In Proceedings of the NAACL HLT 2010
Workshop on Creating Speech and Language Data with
Amazons Mechanical Turk, pages 122130.
R. Snow, B. OConner, D. Jurafsky, and A. Ng. 2008.
Cheap and Fast But is it Good? Evaluating Non-Expert
Annotations for Natural Language Tasks. In Proceed-
ings of the 2008 Conference on Empirical Methods in
Natural Language Processing, pages 254263.
M. Van Assem, A. Gangemi, and G. Schreiber. 2006. Con-
version of WordNet to a standard RDF/OWL represen-
2625