Sense-annotating a Lexical Substitution Data Set with Ubyline

We describe the construction of GLASS, a newly sense-annotated version of the German lexical substitution data set used at the GERMEVAL 2015: LEXSUB shared task. Using the two annotation layers, we conduct the first known empirical study of the relationship between manually applied word senses and lexical substitutions. We find that synonymy and hypernymy/hyponymy are the only semantic relations directly linking targets to their substitutes, and that substitutes in the target’s hypernymy/hyponymy taxonomy closely align with the synonyms of a single GermaNet synset. Despite this, these substitutes account for a minority of those provided by the annotators. The results of our analysis accord with those of a previous study on English-language data (albeit with automatically induced word senses), leading us to suspect that the sense–substitution relations we discovered may be of a universal nature. We also tentatively conclude that relatively cheap lexical substitution annotations can be used as a knowledge source for automatic WSD. Also introduced in this paper is Ubyline, the web application used to produce the sense annotations. Ubyline presents an intuitive user interface optimized for annotating lexical sample data, and is readily adaptable to sense inventories other than GermaNet.

[1]  Helmut Feldweg,et al.  GermaNet - a Lexical-Semantic Net for German , 1997 .

[2]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[3]  Paul Buitelaar,et al.  Evaluation Corpora for Sense Disambiguation in the Medical Domain , 2002, LREC.

[4]  Diana McCarthy,et al.  Lexical Substitution as a Task for WSD Evaluation , 2002, SENSEVAL.

[5]  Rada Mihalcea,et al.  Open Mind Word Expert: Creating Large Annotated Data Collections with Web Users’ Help , 2003, LINC@EACL.

[6]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[7]  Nancy Ide,et al.  Making Sense of Word Sense Variation , 2009, SEW@NAACL-HLT.

[8]  Silvia Bernardini,et al.  The WaCky wide web: a collection of very large linguistically processed web-crawled corpora , 2009, Lang. Resour. Evaluation.

[9]  Simone Paolo Ponzetto,et al.  Rapid Bootstrapping of Word Sense Disambiguation Resources for German , 2010, KONVENS.

[10]  Erhard W. Hinrichs,et al.  GernEdiT - The GermaNet Editing Tool , 2010, LREC.

[11]  Stefan Evert,et al.  Twenty-first century Corpus Workbench: Updating a query architecture for the new millennium , 2011 .

[12]  Erhard W. Hinrichs,et al.  Aligning GermaNet Senses with Wiktionary Sense Definitions , 2011, LTC.

[13]  Hans Schemann Deutsche Idiomatik: Wörterbuch der deutschen Redewendungen im Kontext , 2011 .

[14]  Francis Bond,et al.  Building and Annotating the Linguistically Diverse NTU-MC (NTU-Multilingual Corpus) , 2011, PACLIC.

[15]  Erhard W. Hinrichs,et al.  WebCAGe – A Web-Harvested Corpus Annotated with GermaNet Senses , 2012, EACL.

[16]  Iryna Gurevych,et al.  UBY - A Large-Scale Unified Lexical-Semantic Resource Based on LMF , 2012, EACL.

[17]  Erhard W. Hinrichs,et al.  Automatically Linking GermaNet to Wikipedia for Harvesting Corpus Examples for GermaNet Senses , 2012, J. Lang. Technol. Comput. Linguistics.

[18]  Erhard W. Hinrichs,et al.  Extending the TüBa-D/Z Treebank with GermaNet Sense Annotation , 2013, GSCL.

[19]  Francis Bond,et al.  Linking and Extending an Open Multilingual Wordnet , 2013, ACL.

[20]  Stefan Thater,et al.  What Substitutes Tell Us - Analysis of an “All-Words” Lexical Substitution Corpus , 2014, EACL.

[21]  Iryna Gurevych,et al.  A broad-coverage collection of portable NLP components for building shareable analysis pipelines , 2014, OIAF4HLT@COLING.

[22]  Iryna Gurevych,et al.  Lexical Substitution Dataset for German , 2014, LREC.

[23]  Tristan Miller,et al.  GermEval 2015: LexSub -- A Shared Task for German-language Lexical Substitution , 2015 .

[24]  Verena Henrich,et al.  Word Sense Disambiguation with GermaNet , 2015 .

[25]  Francis Bond,et al.  IMI — A Multilingual Semantic Annotation Environment , 2015, ACL.

[26]  Tristan Miller,et al.  Towards the automatic detection and identification of English puns , 2016 .