Treebanks Gone Bad: Generating a Treebank of Ungrammatical English

This paper describes how a treebank of ungrammatical sentences can be created from a treebank of well-formed sentences. The treebank creation procedure involves the automatic introduction of frequently occurring grammatical errors into the sentences in an existing treebank, and the minimal transformation of the analyses in the treebank so that they describe the newly created ill-formed sentences. Such a treebank can be used to test how well a parser is able to ignore grammatical errors in texts (as people can), and can be used to induce a grammar capable of analysing such sentences. This paper also demonstrates the first of these uses.

[1]  Jennifer Foster Good reasons for noting bad grammar : empirical investigations into the parsing of ungrammatical written English , 2005 .

[2]  Hitoshi Isahara,et al.  The Overview of the SST Speech Corpus of Japanese Learner English and Evaluation Through the Experiment on Automatic Detection of Learners' Errors , 2004, LREC.

[3]  Erik Smitterberg,et al.  International Corpus of Learner English , 2004 .

[4]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[5]  Ralph Grishman,et al.  A Procedure for Quantitatively Comparing the Syntactic Coverage of English Grammars , 1991, HLT.

[6]  Ann Bies,et al.  The Penn Treebank: Annotating Predicate Argument Structure , 1994, HLT.

[7]  Mitchell P. Marcus,et al.  On the parameter space of generative lexicalized statistical parsing models , 2004 .

[8]  Berthold Crysmann,et al.  ANNOTATION OF ERROR TYPES FOR GERMAN NEWS CORPUS , .

[9]  Dekang Lin,et al.  Dependency-Based Evaluation of Minipar , 2003 .

[10]  van der Ielka Sluis,et al.  Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC'04) , 2004 .

[11]  Carl James,et al.  Errors in Language Learning and Use: Exploring Error Analysis , 1998 .

[12]  Wolfgang Sternefeld,et al.  Annotating and Querying a Treebank of Suboptimal Structures , 2004 .

[13]  N. Sheibani,et al.  Paris , 1894, The Hospital.

[14]  Peter Ingels,et al.  A Robust Text Processing Technique Applied to Lexical Error Recovery , 1997, ArXiv.

[15]  Sylviane Granger,et al.  The International Corpus of Learner English , 1993 .

[16]  Johnny Bigert Probabilistic Detection of Context-Sensitive Spelling Errors , 2004, LREC.

[17]  R. Lathe Phd by thesis , 1988, Nature.