Problems in Evaluating Grammatical Error Detection Systems

Many evaluation issues for grammatical error detection have previously been overlooked, making it hard to draw meaningful comparisons between different approaches, even when they are evaluated on the same corpus. To begin with, the three-way contingency between a writer’s sentence, the annotator’s correction, and the system’s output makes evaluation more complex than in some other NLP tasks, which we address by presenting an intuitive evaluation scheme. Of particular importance to error detection is the skew of the data ‐ the low frequency of errors as compared to non-errors ‐ which distorts some traditional measures of performance and limits their usefulness, leading us to recommend the reporting of raw measurements (true positives, false negatives, false positives, true negatives). Other issues that are particularly vexing for error detection focus on defining these raw measurements: specifying the size or scope of an error, properly treating errors as graded rather than discrete phenomena, and counting non-errors. We discuss recommendations for best practices with regard to reporting the results of system evaluation for these cases, recommendations which depend upon making clear one’s assumptions and applications for error detection. By highlighting the problems with current error detection evaluation, the field will be better able to move forward.

[1]  Deryle Lonsdale,et al.  Automated Rating of ESL Essays , 2003, HLT-NAACL 2003.

[2]  Alexandr Rosen,et al.  Error-Tagged Learner Corpus of Czech , 2010, Linguistic Annotation Workshop.

[3]  Martin Chodorow,et al.  Native Judgments of Non-Native Usage: Experiments in Preposition Error Detection , 2008, COLING 2008.

[4]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[5]  Martin Chodorow,et al.  Correcting Comma Errors in Learner Essays, and Restoring Commas in Newswire Text , 2012, NAACL.

[6]  Dan Roth,et al.  Algorithm Selection and Model Adaptation for ESL Correction Tasks , 2011, ACL.

[7]  Nitin Madnani,et al.  E-rating Machine Translation , 2011, WMT@EMNLP.

[8]  Jill Burstein,et al.  AUTOMATED ESSAY SCORING WITH E‐RATER® V.2.0 , 2004 .

[9]  Kimberly A. Neuendorf,et al.  Reliability for Content Analysis , 2010 .

[10]  Jennifer Foster,et al.  Using Parse Features for Preposition Selection and Error Detection , 2010, ACL.

[11]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[12]  Kevin Knight,et al.  Automated Postediting of Documents , 1994, AAAI.

[13]  Adriane Boyd,et al.  EAGLE: an Error-Annotated Corpus of Beginning Learner German , 2010, LREC.

[14]  Na-Rae Han,et al.  Using an Error-Annotated Learner Corpus to Develop an ESL/EFL Error Correction System , 2010, LREC.

[15]  Walt Detmar Meurers,et al.  Conceptualizing Student Models for ICALL , 2007, User Modeling.

[16]  Hwee Tou Ng,et al.  Better Evaluation for Grammatical Error Correction , 2012, NAACL.

[17]  Martin Chodorow,et al.  The Ups and Downs of Preposition Error Detection in ESL Writing , 2008, COLING.

[18]  Graeme Hirst,et al.  Correcting real-word spelling errors by restoring lexical cohesion , 2005, Natural Language Engineering.

[19]  Robert Dale,et al.  HOO 2012: A Report on the Preposition and Determiner Error Correction Shared Task , 2012, BEA@NAACL-HLT.

[20]  David M. W. Powers,et al.  The Problem with Kappa , 2012, EACL.

[21]  Adam Kilgarriff,et al.  Helping Our Own: Text Massaging for Computational Linguistics as a New Shared Task , 2010, INLG.

[22]  Nitin Madnani,et al.  They Can Help: Using Crowdsourcing to Improve the Evaluation of Grammatical Error Detection Systems , 2011, ACL.

[23]  Stuart M. Shieber,et al.  Comma Restoration Using Constituency Information , 2003, HLT-NAACL.

[24]  Markus Dickinson,et al.  Developing Learner Corpus Annotation for Korean Particle Errors , 2012, LAW@ACL.

[25]  Stephanie Seneff,et al.  Correcting Misuse of Verb Forms , 2008, ACL.

[26]  N. A-R A E H A N,et al.  Detecting errors in English article usage by non-native speakers , 2006 .

[27]  Markus Dickinson,et al.  Annotating Errors in a Hungarian Learner Corpus , 2012, LREC.

[28]  K. Krippendorff Reliability in Content Analysis: Some Common Misconceptions and Recommendations , 2004 .

[29]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[30]  K. Araki,et al.  Correction of article errors in machine translation using Web-based model , 2005, 2005 International Conference on Natural Language Processing and Knowledge Engineering.

[31]  Robert Dale,et al.  A Framework for Evaluating Text Correction , 2012, LREC.

[32]  Michael Gamon,et al.  Using Mostly Native Data to Correct Errors in Learners’ Writing , 2010, NAACL.

[33]  Joel R. Tetreault,et al.  The utility of article and preposition error correction systems for English language learners: Feedback and assessment , 2010 .

[34]  Hwee Tou Ng,et al.  Correcting Semantic Collocation Errors with L1-induced Paraphrases , 2011, EMNLP.

[35]  C. Chapelle,et al.  Natural Language Processing and Language Learning , 2012 .

[36]  Claudia Leacock,et al.  Automated Grammatical Error Correction for Language Learners , 2010, COLING.

[37]  Dan Roth,et al.  Training Paradigms for Correcting Errors in Grammar and Usage , 2010, NAACL.

[38]  Roger Levy,et al.  Automated Whole Sentence Grammar Correction Using a Noisy Channel Model , 2011, ACL.

[39]  Martin Chodorow,et al.  CriterionSM Online Essay Evaluation: An Application for Automated Evaluation of Student Essays , 2003, IAAI.

[40]  Trude Heift,et al.  Heift Trude Schulze Mathias. Errors and Intelligence in Computer-Assisted Language Learning: Parsers and Pedagogues Routledge (Routledge series in computer-assisted language learning), 2007. xviii+283 Pages. ISBN: 978-0-415-36191-0. Price: $115 , 2009, ReCALL.

[41]  Michael Gamon High-Order Sequence Modeling for Language Learner Error Detection , 2011, BEA@ACL.

[42]  Markus Dickinson,et al.  Developing Methodology for Korean Particle Error Detection , 2011, BEA@ACL.

[43]  Helen Yannakoudakis,et al.  A New Dataset and Method for Automatically Grading ESOL Texts , 2011, ACL.

[44]  Dan Roth,et al.  Annotating ESL Errors: Challenges and Rewards , 2010 .