WebCAGe – A Web-Harvested Corpus Annotated with GermaNet Senses

This paper describes an automatic method for creating a domain-independent sense-annotated corpus harvested from the web. As a proof of concept, this method has been applied to German, a language for which sense-annotated corpora are still in short supply. The sense inventory is taken from the German wordnet GermaNet. The web-harvesting relies on an existing mapping of GermaNet to the German version of the web-based dictionary Wiktionary. The data obtained by this method constitute WebCAGe (short for: Web-Harvested Corpus Annotated with GermaNet Senses), a resource which currently represents the largest sense-annotated corpus available for German. While the present paper focuses on one particular language, the method as such is language-independent.

[1]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[2]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[3]  Erhard W. Hinrichs,et al.  GernEdiT - The GermaNet Editing Tool , 2010, LREC.

[4]  Rada Mihalcea,et al.  An Automatic Method for Generating Sense Tagged Corpora , 1999, AAAI/IAAI.

[5]  Simone Paolo Ponzetto,et al.  BabelNet: Building a Very Large Multilingual Semantic Network , 2010, ACL.

[6]  Peng Jin,et al.  A Chinese Corpus with Word Sense Annotation , 2006, ICCPOL.

[7]  Eneko Agirre,et al.  Publicly Available Topic Signatures for all WordNet Nominal Senses , 2004, LREC.

[8]  Eneko Agirre,et al.  Proceedings of the 4th International Workshop on Semantic Evaluations , 2007 .

[9]  Iryna Gurevych,et al.  What Psycholinguists Know About Chemistry: Aligning Wiktionary and WordNet for Increased Domain Coverage , 2011, IJCNLP.

[10]  George A. Miller,et al.  Using Corpus Statistics and WordNet Relations for Sense Identification , 1998, CL.

[11]  Simone Paolo Ponzetto,et al.  Rapid Bootstrapping of Word Sense Disambiguation Resources for German , 2010, KONVENS.

[12]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[13]  E. Hinrichs,et al.  An Automatic Method for Creating a Sense-Annotated Corpus Harvested from the Web , 2013 .

[14]  Paul Buitelaar,et al.  Evaluation Corpora for Sense Disambiguation in the Medical Domain , 2002, LREC.

[15]  Julio Gonzalo,et al.  Automatic Association of Web Directories with Word Senses , 2003, Computational Linguistics.

[16]  Carlo Strapparava,et al.  Proceedings of the 5th International Workshop on Semantic Evaluation , 2010 .

[17]  Claudia Kunze,et al.  GermaNet - representation, visualization, application , 2002, LREC.