An Automatic Method for Creating a Sense-Annotated Corpus Harvested from the Web

This paper reports on an automatic and language-independent method for compiling a sense-annotated corpus of web data. To validate its language-independence, the method has been applied to English and German. The sense inventories are taken from the Princeton WordNet for English and from the German wordnet GermaNet. The web-harvesting utilizes existing mappings of WordNet and GermaNet to the English and German versions of the web-based dictionary Wiktionary, respectively. The data obtained by this method have resulted in the English WebCAP (short for: Web-Harvested Corpus Annotated with Princeton WordNet Senses) and the German WebCAGe (short for: Web-Harvested Corpus Annotated with GermaNet Senses) resources.

[1]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[2]  Carlo Strapparava,et al.  Proceedings of the 5th International Workshop on Semantic Evaluation , 2010 .

[3]  Simone Paolo Ponzetto,et al.  Rapid Bootstrapping of Word Sense Disambiguation Resources for German , 2010, KONVENS.

[4]  Julio Gonzalo,et al.  Automatic Association of Web Directories with Word Senses , 2003, Computational Linguistics.

[5]  Peng Jin,et al.  A Chinese Corpus with Word Sense Annotation , 2006, ICCPOL.

[6]  Rada Mihalcea,et al.  An Automatic Method for Generating Sense Tagged Corpora , 1999, AAAI/IAAI.

[7]  Paul Buitelaar,et al.  Evaluation Corpora for Sense Disambiguation in the Medical Domain , 2002, LREC.

[8]  Iryna Gurevych,et al.  What Psycholinguists Know About Chemistry: Aligning Wiktionary and WordNet for Increased Domain Coverage , 2011, IJCNLP.

[9]  Erhard W. Hinrichs,et al.  Automatically Linking GermaNet to Wikipedia for Harvesting Corpus Examples for GermaNet Senses , 2012, J. Lang. Technol. Comput. Linguistics.

[10]  Simone Paolo Ponzetto,et al.  BabelNet: Building a Very Large Multilingual Semantic Network , 2010, ACL.

[11]  Eneko Agirre,et al.  Publicly Available Topic Signatures for all WordNet Nominal Senses , 2004, LREC.

[12]  Claudia Kunze,et al.  GermaNet - representation, visualization, application , 2002, LREC.

[13]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[14]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[15]  Erhard W. Hinrichs,et al.  WebCAGe – A Web-Harvested Corpus Annotated with GermaNet Senses , 2012, EACL.

[16]  Eneko Agirre,et al.  Proceedings of the 4th International Workshop on Semantic Evaluations , 2007 .

[17]  Erhard W. Hinrichs,et al.  GernEdiT - The GermaNet Editing Tool , 2010, LREC.

[18]  George A. Miller,et al.  Using Corpus Statistics and WordNet Relations for Sense Identification , 1998, CL.