Increasing the Recall of Corpus Annotation Error Detection

While error detection approaches have been developed for various types of corpus annotation, so far only limited attention has been paid to the recall of those methods. We show how the recall of the so-called variation n-gram method can be increased by examining comparable part-of-speech tag sequences instead of the recurring strings themselves. To guide the search for erroneous annotation and to distinguish errors with high precision, we also develop new context reliability indicators.

[1]  Toben H. Mintz Frequent frames as a cue for grammatical categories in child directed speech , 2003, Cognition.

[2]  Tylman Ule,et al.  Unexpected Productions May Well be Errors , 2004, LREC.

[3]  李幼升,et al.  Ph , 1989 .

[4]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[5]  Markus Dickinson,et al.  Error detection and correction in annotated corpora , 2005 .

[6]  Deirdre Hogan,et al.  Coordinate Noun Phrase Disambiguation in a Generative Parsing Model , 2007, ACL.

[7]  Karel Oliva,et al.  Achieving an Almost Correct PoS-Tagged Corpus , 2002, TSD.

[8]  Lluís Padró,et al.  On the Evaluation and Comparison of Taggers: the Effect of Noise in Testing Corpora , 1998, COLING-ACL.

[9]  Walter Daelemans,et al.  Improving Accuracy in word class tagging through the Combination of Machine Learning Systems , 2001, CL.

[10]  Walt Detmar Meurers,et al.  Detecting Errors in Discontinuous Structural Annotation , 2005, ACL.

[11]  Walt Detmar Meurers,et al.  Detecting Errors in Part-of-Speech Annotation , 2003, EACL.

[12]  Markus Dickinson Prune Diseased Branches to Get Healthy Trees ! How to Find Erroneous Local Trees in a Treebank and Why It Matters , 2005 .

[13]  Dan Klein,et al.  A Generative Constituent-Context Model for Improved Grammar Induction , 2002, ACL.

[14]  Alan Lee,et al.  Proceedings of the 19th International Workshop on Treebanks and Linguistic Theories , 2006 .

[15]  Walt Detmar Meurers,et al.  Detecting Inconsistencies in Treebanks , 2003 .