Error detection and correction in annotated corpora

Building on work showing the harmfulness of annotation errors for both the training and evaluation of natural language processing technologies, this thesis develops a method for detecting and correcting errors in corpora with linguistic annotation. The so-called variation n-gram method relies on the recurrence of identical strings with varying annotation to find erroneous mark-up. We show that the method is applicable for varying complexities of annotation. The method is most readily applied to positional annotation, such as part-of-speech annotation, but can be extended to structural annotation, both for tree structures— as with syntactic annotation—and for graph structures—as with syntactic annotation allowing discontinuous constituents, or crossing branches. Furthermore, we demonstrate that the notion of variation for detecting errors is a powerful one, by searching for grammar rules in a treebank which have the same daughters but different mothers. We also show that such errors impact the effectiveness of a grammar induction algorithm and subsequent parsing. After detecting errors in the different corpora, we turn to correcting such errors, through the use of more general classification techniques. Our results indicate that the particular classification algorithm is less important than understanding the nature of the errors and altering the classifiers to deal with these errors. With such alterations, we can automatically correct errors with 85% accuracy. By sorting the errors, we can

[1]  Tylman Ule,et al.  Unexpected Productions May Well be Errors , 2004, LREC.

[2]  Bas Aarts,et al.  Exploring Natural Language: Working with the British Component of the International Corpus of English , 2002 .

[3]  J rgen Lenerz,et al.  Word Order Variation: Competition or Co-Operation? , 2001 .

[4]  Wojciech Skut,et al.  An Annotation Scheme for Free Word Order Languages , 1997, ANLP.

[5]  Richard M. Schwartz,et al.  An Algorithm that Learns What's in a Name , 1999, Machine Learning.

[6]  David Elworthy Automatic Error Detection in Part of Speech Tagging , 1994, ArXiv.

[7]  Raymond J. Mooney,et al.  Active Learning for Natural Language Parsing and Information Extraction , 1999, ICML.

[8]  James P. Blevins Syntactic complexity : evidence for discontinuity and multidomination , 1990 .

[9]  Olivier Bonami,et al.  Constituency and word order in French subject inversion , 1999 .

[10]  Penelope Sibun,et al.  A Practical Part-of-Speech Tagger , 1992, ANLP.

[11]  Rens Bod Extracting stochastic grammars from treebanks , 2003 .

[12]  Beatrice Santorini,et al.  The Penn Treebank: An Overview , 2003 .

[13]  Mark Liberman,et al.  A formal framework for linguistic annotation , 1999, Speech Commun..

[14]  Atro Voutilainen,et al.  Specifying a shallow grammatical representation for parsing purposes , 1995, EACL.

[15]  Ferran Plà,et al.  Improving part-of-speech tagging using lexicalized HMMs , 2004, Natural Language Engineering.

[16]  Walter Daelemans,et al.  MBT: A Memory-Based Part of Speech Tagger-Generator , 1996, VLC@COLING.

[17]  David M. Magerman Statistical Decision-Tree Models for Parsing , 1995, ACL.

[18]  Geoffrey Leech,et al.  CLAWS4: The Tagging of the British National Corpus , 1994, COLING.

[19]  Manfred Pinkal,et al.  Towards a Resource for Lexical Semantics: A Large German Corpus with Extensive Semantic Annotation , 2003, ACL.

[20]  Kenneth Ward Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[21]  Steven J. DeRose,et al.  Grammatical Category Disambiguation by Statistical Optimization , 1988, CL.

[22]  Lluís Padró,et al.  A Flexible POS Tagger Using an Automatically Acquired Language Model , 1997, ACL.

[23]  Yoram Singer,et al.  Boosting Applied to Tagging and PP Attachment , 1999, EMNLP.

[24]  Eric Brill,et al.  Unsupervised Learning of Disambiguation Rules for Part of Speech Tagging , 1995, VLC@ACL.

[25]  Glyn Morrill,et al.  Discontinuity in categorial grammar , 1995 .

[26]  Stephan Oepen,et al.  Towards holistic grammar engineering and testing : grafting treebank maintenance into the grammar revision cycle. , 2004 .

[27]  George R. Doddington,et al.  The ATIS Spoken Language Systems Pilot Corpus , 1990, HLT.

[28]  Ralph Grishman,et al.  A Procedure for Quantitatively Comparing the Syntactic Coverage of English Grammars , 1991, HLT.

[29]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[30]  Constantin Orasan,et al.  The importance of annotated corpora for NLP: the cases of anaphora resolution and clause splitting , 2000 .

[31]  Hans van Halteren,et al.  The Detection of Inconsistency in Manually Tagged Text , 2000, COLING 2000.

[32]  David Fisher,et al.  CRYSTAL: Inducing a Conceptual Dictionary , 1995, IJCAI.

[33]  Wolfgang Wahlster,et al.  Verbmobil: Foundations of Speech-to-Speech Translation , 2000, Artificial Intelligence.

[34]  Walt Detmar Meurers,et al.  A Grammar Formalism and Parser for Linearization-based HPSG , 2004, COLING.

[35]  F. Segond,et al.  An Experiment in Semantic Tagging using Hidden Markov Model Tagging , 1997 .

[36]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[37]  Dan Klein,et al.  Parsing with Treebank Grammars: Empirical Bounds, Theoretical Models, and the Structure of the Penn Treebank , 2001, ACL.

[38]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[39]  Eric Brill,et al.  Some Advances in Transformation-Based Part of Speech Tagging , 1994, AAAI.

[40]  Yuji Matsumoto,et al.  Detecting Errors in Corpora Using Support Vector Machines , 2002, COLING.

[41]  Wojciech Skut,et al.  SYNTACTIC ANNOTATION OF A GERMAN NEWSPAPER CORPUS , 2003 .

[42]  Michael Collins,et al.  A New Statistical Parser Based on Bigram Lexical Dependencies , 1996, ACL.

[43]  Karel Oliva,et al.  The Possibilities of Automatic Detection/Correction of Errors in Tagged Corpora: A Pilot Study on a German Corpus , 2001, TSD.

[44]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[45]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[46]  Geoffrey Sampson,et al.  English for the Computer: The SUSANNE Corpus and Analytic Scheme , 1995, Computational Linguistics.

[47]  Anthony McEnery,et al.  The UCREL Semantic Analysis System , 2004 .

[48]  Yorick Wilks,et al.  Compacting the Penn Treebank Grammar , 1998, ACL.

[49]  Anthony S. Kroch,et al.  Analyzing extraposition in a Tree Adjoining Gram-mar , 1987 .

[50]  Mark Hepple Discontinuity And The Lambek Calculus , 1994, COLING.

[51]  Thorsten Brants Estimating Markov model structures , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[52]  Thorsten Brants,et al.  Interactive Corpus Annotation , 2000, LREC.

[53]  Wojciech Skut,et al.  Automation of Treebank Annotation , 1998, CoNLL.

[54]  Edward Fredkin,et al.  Trie memory , 1960, Commun. ACM.

[55]  Sean Wallis Completing Parsed Corpora , 2003 .

[56]  Karel Oliva,et al.  Achieving an Almost Correct PoS-Tagged Corpus , 2002, TSD.

[57]  David A. Cohn,et al.  Improving generalization with active learning , 1994, Machine Learning.

[58]  Jonathan Calder On aligning trees , 1997, EMNLP.

[59]  Nancy Ide,et al.  Multext-East: Parallel and Comparable Corpora and Lexicons for Six Central and Eastern European Languages , 1998, COLING-ACL.

[60]  Robert L. Mercer,et al.  Word-Sense Disambiguation Using Statistical Methods , 1991, ACL.

[61]  Geoffrey Sampson,et al.  Limits to annotation precision , 2003, LINC@EACL.

[62]  Eugene Charniak,et al.  Equations for Part-of-Speech Tagging , 1993, AAAI.

[63]  Mike Reape,et al.  A formal theory of word order : a case study in West Germanic , 1994 .

[64]  Walt Detmar Meurers,et al.  On the use of electronic corpora for theoretical linguistics , 2005, Lingua.

[65]  Aravind K. Joshi,et al.  A Formal Look at Dependency Grammars and Phrase-Structure Grammars, with Special Consideration of Word-Order Phenomena , 1994, ArXiv.

[66]  Don Blaheta Handling Noisy Training and Testing Data , 2002, EMNLP.

[67]  Mark Johnson,et al.  Parsing with Discontinuous Constituents , 1985, ACL.

[68]  Eugene Charniak,et al.  Tree-Bank Grammars , 1996, AAAI/IAAI, Vol. 2.

[69]  Daniel Hirst,et al.  Multi-level annotation for spoken language corpora , 2000, INTERSPEECH.

[70]  Andreas Kathol,et al.  Linearization-based German syntax , 1995 .

[71]  John Sinclair,et al.  The automatic analysis of corpora , 1992 .

[72]  John D. Lafferty,et al.  The Candide System for Machine Translation , 1994, HLT.

[73]  James Clark,et al.  XSL Transformations (XSLT) Version 1.0 , 1999 .