Human Evaluation of Grammatical Error Correction Systems

The paper presents the results of the first large-scale human evaluation of automatic grammatical error correction (GEC) systems. Twelve participating systems and the unchanged input of the CoNLL-2014 shared task have been reassessed in a WMT-inspired human evaluation procedure. Methods introduced for the Workshop of Machine Translation evaluation campaigns have been adapted to GEC and extended where necessary. The produced rankings are used to evaluate standard metrics for grammatical error correction in terms of correlation with human judgment.

[1]  Tom Minka,et al.  TrueSkillTM: A Bayesian Skill Rating System , 2006, NIPS.

[2]  Hwee Tou Ng,et al.  The CoNLL-2013 Shared Task on Grammatical Error Correction , 2013, CoNLL Shared Task.

[3]  Hwee Tou Ng,et al.  Better Evaluation for Grammatical Error Correction , 2012, NAACL.

[4]  Adam Kilgarriff,et al.  Helping Our Own: The HOO 2011 Pilot Shared Task , 2011, ENLG.

[5]  Ted Briscoe,et al.  Towards a standard evaluation method for grammatical error detection and correction , 2015, NAACL.

[6]  Ondrej Bojar,et al.  Results of the WMT13 Metrics Shared Task , 2015, WMT@EMNLP.

[7]  Matt Post,et al.  Ground Truth for Grammatical Error Correction Metrics , 2015, ACL.

[8]  Alon Lavie,et al.  Meteor 1.3: Automatic Metric for Reliable Optimization and Evaluation of Machine Translation Systems , 2011, WMT@EMNLP.

[9]  Martin Chodorow,et al.  Problems in Evaluating Grammatical Error Detection Systems , 2012, COLING.

[10]  Robert Dale,et al.  HOO 2012: A Report on the Preposition and Determiner Error Correction Shared Task , 2012, BEA@NAACL-HLT.

[11]  Nitin Madnani,et al.  They Can Help: Using Crowdsourcing to Improve the Evaluation of Grammatical Error Detection Systems , 2011, ACL.

[12]  Christian Federmann,et al.  Appraise: An Open-Source Toolkit for Manual Phrase-Based Evaluation of Translations , 2010, LREC.

[13]  Philipp Koehn,et al.  Further Meta-Evaluation of Machine Translation , 2008, WMT@ACL.

[14]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[15]  Philipp Koehn,et al.  Findings of the 2014 Workshop on Statistical Machine Translation , 2014, WMT@ACL.

[16]  Raymond Hendy Susanto,et al.  The CoNLL-2014 Shared Task on Grammatical Error Correction , 2014 .

[17]  Nizar Habash,et al.  The First QALB Shared Task on Automatic Text Correction for Arabic , 2014, ANLP@EMNLP.

[18]  Nizar Habash,et al.  The Second QALB Shared Task on Automatic Text Correction for Arabic , 2015, ANLP@ACL.

[19]  Philipp Koehn,et al.  Findings of the 2013 Workshop on Statistical Machine Translation , 2013, WMT@ACL.

[20]  Philipp Koehn Simulating human judgment in machine translation evaluation campaigns , 2012, IWSLT.

[21]  Matt Post,et al.  Efficient Elicitation of Annotations for Human Evaluation of Machine Translation , 2014, WMT@ACL.