Reassessing the Goals of Grammatical Error Correction: Fluency Instead of Grammaticality

The field of grammatical error correction (GEC) has grown substantially in recent years, with research directed at both evaluation metrics and improved system performance against those metrics. One unvisited assumption, however, is the reliance of GEC evaluation on error-coded corpora, which contain specific labeled corrections. We examine current practices and show that GEC’s reliance on such corpora unnaturally constrains annotation and automatic evaluation, resulting in (a) sentences that do not sound acceptable to native speakers and (b) system rankings that do not correlate with human judgments. In light of this, we propose an alternate approach that jettisons costly error coding in favor of unannotated, whole-sentence rewrites. We compare the performance of existing metrics over different gold-standard annotations, and show that automatic evaluation with our new annotation scheme has very strong correlation with expert rankings (ρ = 0.82). As a result, we advocate for a fundamental and necessary shift in the goal of GEC, from correcting small, labeled error types, to producing text that has native fluency.

[1]  Raymond Hendy Susanto,et al.  The CoNLL-2014 Shared Task on Grammatical Error Correction , 2014 .

[2]  Maria Leonor Pacheco,et al.  of the Association for Computational Linguistics: , 2001 .

[3]  Philipp Koehn,et al.  Findings of the 2014 Workshop on Statistical Machine Translation , 2014, WMT@ACL.

[4]  Stephen Wan,et al.  GLEU: Automatic Evaluation of Sentence-Level Fluency , 2007, ACL.

[5]  Rebecca Hwa,et al.  Improved Correction Detection in Revised ESL Sentences , 2014, ACL.

[6]  Dan Roth,et al.  Building a State-of-the-Art Grammatical Error Correction System , 2014, TACL.

[7]  Philipp Koehn,et al.  Findings of the 2015 Workshop on Statistical Machine Translation , 2015, WMT@EMNLP.

[8]  Hwee Tou Ng,et al.  Building a Large Annotated Corpus of Learner English: The NUS Corpus of Learner English , 2013, BEA@NAACL-HLT.

[9]  Matt Post,et al.  Ground Truth for Grammatical Error Correction Metrics , 2015, ACL.

[10]  Robert Dale,et al.  HOO 2012: A Report on the Preposition and Determiner Error Correction Shared Task , 2012, BEA@NAACL-HLT.

[11]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[12]  Nitin Madnani,et al.  They Can Help: Using Crowdsourcing to Improve the Evaluation of Grammatical Error Detection Systems , 2011, ACL.

[13]  Benjamin Swanson,et al.  Correction Detection and Error Type Selection as an ESL Educational Aid , 2012, HLT-NAACL.

[14]  Martin Chodorow,et al.  Native Judgments of Non-Native Usage: Experiments in Preposition Error Detection , 2008, COLING 2008.

[15]  Chris Callison-Burch,et al.  Crowdsourcing for grammatical error correction , 2014, CSCW Companion '14.

[16]  Helen Yannakoudakis,et al.  A New Dataset and Method for Automatically Grading ESOL Texts , 2011, ACL.

[17]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[18]  Martin Chodorow,et al.  Problems in Evaluating Grammatical Error Detection Systems , 2012, COLING.

[19]  Dan Roth,et al.  Annotating ESL Errors: Challenges and Rewards , 2010 .

[20]  Adam Kilgarriff,et al.  Helping Our Own: The HOO 2011 Pilot Shared Task , 2011, ENLG.

[21]  Martin Chodorow,et al.  An Unsupervised Method for Detecting Grammatical Errors , 2000, ANLP.

[22]  Hwee Tou Ng,et al.  The CoNLL-2013 Shared Task on Grammatical Error Correction , 2013, CoNLL Shared Task.

[23]  Nitin Madnani,et al.  Bucking the trend: improved evaluation and annotation practices for ESL error detection systems , 2014, Lang. Resour. Evaluation.

[24]  Martin Chodorow,et al.  Rethinking Grammatical Error Annotation and Evaluation with the Amazon Mechanical Turk , 2010 .

[25]  Tom Minka,et al.  TrueSkillTM: A Bayesian Skill Rating System , 2006, NIPS.

[26]  Philipp Koehn Simulating human judgment in machine translation evaluation campaigns , 2012, IWSLT.

[27]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[28]  Matt Post,et al.  Efficient Elicitation of Annotations for Human Evaluation of Machine Translation , 2014, WMT@ACL.

[29]  Hwee Tou Ng,et al.  How Far are We from Fully Automatic High Quality Grammatical Error Correction? , 2015, ACL.

[30]  Hwee Tou Ng,et al.  Better Evaluation for Grammatical Error Correction , 2012, NAACL.

[31]  Ted Briscoe,et al.  Towards a standard evaluation method for grammatical error detection and correction , 2015, NAACL.

[32]  Marcin Junczys-Dowmunt,et al.  Human Evaluation of Grammatical Error Correction Systems , 2015, EMNLP.

[33]  D Nicholls,et al.  The Cambridge Learner Corpus-Error coding and analysis , 1999 .