Detection of Annotation Errors in Corpora

This paper surveys methods for annotation error detection and correction. Methods can broadly be characterized as to whether they detect inconsistencies with respect to some statistical model based only on the corpus data or whether they detect inconsistencies with respect to a grammatical model, in general, some external information source. Two extended examples are presented, illustrating these different techniques: (1) the variation n-gram method, which searches for inconsistences in annotation for identical strings; and (2) a method of ad hoc rule detection, for syntactic annotation, which compares treebank rules to a grammar to determine which are anomalous. Methods for detecting annotation errors have developed much over the last decade, and thus corpus practitioners can benefit greatly from them, while at the same time NLP researchers can learn more about the nuances of the annotation they use and see how error correction methods intersect with NLP techniques.

[1]  Sean Wallis Completing Parsed Corpora , 2003 .

[2]  Roman Grundkiewicz,et al.  Automatic Extraction of Polish Language Errors from Text Edition History , 2013, TSD.

[3]  Karel Oliva,et al.  Achieving an Almost Correct PoS-Tagged Corpus , 2002, TSD.

[4]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[5]  Benoît Sagot,et al.  Error Mining in Parsing Results , 2006, ACL.

[6]  Markus Dickinson,et al.  Error detection and correction in annotated corpora , 2005 .

[7]  Masaki Murata,et al.  On-Line Error Detection of Annotated Corpus Using Modular Neural Networks , 2001, ICANN.

[8]  Walt Detmar Meurers,et al.  On Detecting Errors in Dependency Treebanks , 2008 .

[9]  Paul Clough,et al.  Old and new challenges in automatic plagiarism detection , 2003 .

[10]  Julia Hockenmaier,et al.  Data and models for statistical parsing with combinatory categorial grammar , 2003 .

[11]  Walter Daelemans,et al.  Improving Accuracy in word class tagging through the Combination of Machine Learning Systems , 2001, CL.

[12]  Wolfgang Lezius,et al.  TIGER: Linguistic Interpretation of a German Corpus , 2004 .

[13]  Dan Tufis,et al.  RoCo-News: A Hand Validated Journalistic Corpus of Romanian , 2006, LREC.

[14]  Masaki Murata,et al.  Correction of errors in a verb modality corpus for machine translation with a machine-learning method , 2005, TALIP.

[15]  Daniel A. Keim,et al.  Visualizing vowel harmony , 2011 .

[16]  Ann Bies,et al.  Bracketing Guidelines For Treebank II Style Penn Treebank Project , 1995 .