Detecting annotation errors in spoken language corpora

Consistency of corpus annotation is an essential property for the many uses of annotated corpora in computational and theoretical linguistics. While some research addresses the detection of inconsistencies in part-of-speech and other positional annotation (van Halteren, 2000; Eskin, 2000; Dickinson and Meurers, 2003a), more recently work has also started to address errors in syntactic and other structural annotation (Dickinson and Meurers, 2003b, 2005; Ule and Simov, 2004; Dickinson, 2005). Spoken language differs in many respects from written language, but to the best of our knowledge the issue of detecting errors in the annotation of spoken language corpora has not yet been systematically addressed. This is significant since spoken data is increasingly relevant for linguistic and computational research—and such corpora are starting to become more readily available, as illustrated by the holdings of the Linguistic Data Consortium (http://www.ldc.upenn.edu). This paper addresses the issue, based on the variation n-gram error detection approach developed in Dickinson and Meurers (2003a). We use the German Verbmobil treebank (Hinrichs et al., 2000) as an exemplar of a spoken language corpus and discuss properties of such corpora which are relevant when adapting the variation n-gram approach for detecting errors in syntactic annotation of spoken language corpora.

[1]  Geoffrey Nunberg,et al.  The linguistics of punctuation , 1990 .

[2]  E. Hinrichs,et al.  The Tübingen Treebanks for Spoken German, English, and Japanese , 2000 .

[3]  Walt Detmar Meurers,et al.  Detecting Errors in Discontinuous Structural Annotation , 2005, ACL.

[4]  Hans van Halteren,et al.  The Detection of Inconsistency in Manually Tagged Text , 2000, COLING 2000.

[5]  Treebank Penn,et al.  Linguistic Data Consortium , 1999 .

[6]  David R. Traum,et al.  Utterance Units in Spoken Dialogue , 1996, ECAI Workshop on Dialogue Processing in Spoken Language Systems.

[7]  Karel Oliva,et al.  Achieving an Almost Correct PoS-Tagged Corpus , 2002, TSD.

[8]  Walt Detmar Meurers,et al.  Detecting Inconsistencies in Treebanks , 2003 .

[9]  Walt Detmar Meurers,et al.  Detecting Errors in Part-of-Speech Annotation , 2003, EACL.

[10]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[11]  Markus Dickinson,et al.  Error detection and correction in annotated corpora , 2005 .

[12]  Walt Detmar Meurers,et al.  On the use of electronic corpora for theoretical linguistics , 2005, Lingua.

[13]  Susan Brennan,et al.  Processes that shape conversation and their implications for computational linguistics , 2000, ACL 2000.

[14]  Lluís Padró,et al.  On the Evaluation and Comparison of Taggers: the Effect of Noise in Testing Corpora , 1998, COLING-ACL.

[15]  Walter Daelemans,et al.  Improving Accuracy in word class tagging through the Combination of Machine Learning Systems , 2001, CL.

[16]  Tylman Ule,et al.  Unexpected Productions May Well be Errors , 2004, LREC.

[17]  Sabine Brants,et al.  The TIGER Treebank , 2001 .

[18]  Christine Thielen,et al.  Ein kleines und erweitertes Tagset fürs Deutsche , 1996 .