A Reassessment of Reference-Based Grammatical Error Correction Metrics

Several metrics have been proposed for evaluating grammatical error correction (GEC) systems based on grammaticality, fluency, and adequacy of the output sentences. Previous studies of the correlation of these metrics with human quality judgments were inconclusive, due to the lack of appropriate significance tests, discrepancies in the methods, and choice of datasets used. In this paper, we re-evaluate reference-based GEC metrics by measuring the system-level correlations with humans on a large dataset of human judgments of GEC outputs, and by properly conducting statistical significance tests. Our results show no significant advantage of GLEU over MaxMatch (M2), contradicting previous studies that claim GLEU to be superior. For a finer-grained analysis, we additionally evaluate these metrics for their agreement with human judgments at the sentence level. Our sentence-level analysis indicates that comparing GLEU and M2, one metric may be more useful than the other depending on the scenario. We further qualitatively analyze these metrics and our findings show that apart from being less interpretable and non-deterministic, GLEU also produces counter-intuitive scores in commonly occurring test examples.

[1]  Ondrej Bojar,et al.  Results of the WMT13 Metrics Shared Task , 2015, WMT@EMNLP.

[2]  Marcin Junczys-Dowmunt,et al.  Human Evaluation of Grammatical Error Correction Systems , 2015, EMNLP.

[3]  Matt Post,et al.  Efficient Elicitation of Annotations for Human Evaluation of Machine Translation , 2014, WMT@ACL.

[4]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[5]  Ted Briscoe,et al.  Automatic Annotation and Evaluation of Error Types for Grammatical Error Correction , 2017, ACL.

[6]  Hwee Tou Ng,et al.  How Far are We from Fully Automatic High Quality Grammatical Error Correction? , 2015, ACL.

[7]  S. Lewis,et al.  Regression analysis , 2007, Practical Neurology.

[8]  Matt Post,et al.  Ground Truth for Grammatical Error Correction Metrics , 2015, ACL.

[9]  Philipp Koehn,et al.  Ten Years of WMT Evaluation Campaigns: Lessons Learnt , 2016 .

[10]  Ondrej Bojar,et al.  Results of the WMT14 Metrics Shared Task , 2013 .

[11]  Raymond Hendy Susanto,et al.  The CoNLL-2014 Shared Task on Grammatical Error Correction , 2014 .

[12]  Helen Yannakoudakis,et al.  Compositional Sequence Labeling Models for Error Detection in Learner Writing , 2016, ACL.

[13]  Nizar Habash,et al.  The Second QALB Shared Task on Automatic Text Correction for Arabic , 2015, ANLP@ACL.

[14]  Hwee Tou Ng,et al.  The CoNLL-2013 Shared Task on Grammatical Error Correction , 2013, CoNLL Shared Task.

[15]  Matt Post,et al.  Reassessing the Goals of Grammatical Error Correction: Fluency Instead of Grammaticality , 2016, TACL.

[16]  Adam Kilgarriff,et al.  Helping Our Own: The HOO 2011 Pilot Shared Task , 2011, ENLG.

[17]  Joel R. Tetreault,et al.  There’s No Comparison: Reference-less Evaluation Metrics in Grammatical Error Correction , 2016, EMNLP.

[18]  Hwee Tou Ng,et al.  Better Evaluation for Grammatical Error Correction , 2012, NAACL.

[19]  Ted Briscoe,et al.  Towards a standard evaluation method for grammatical error detection and correction , 2015, NAACL.

[20]  Ondrej Bojar,et al.  Results of the WMT16 Metrics Shared Task , 2016 .

[21]  Matt Post,et al.  GLEU Without Tuning , 2016, ArXiv.

[22]  Ondrej Bojar,et al.  Results of the WMT17 Metrics Shared Task , 2017, WMT.

[23]  Philipp Koehn,et al.  Findings of the 2013 Workshop on Statistical Machine Translation , 2013, WMT@ACL.

[24]  Philipp Koehn,et al.  Findings of the 2012 Workshop on Statistical Machine Translation , 2012, WMT@NAACL-HLT.

[25]  Joel R. Tetreault,et al.  GEC into the future: Where are we going and how do we get there? , 2017, BEA@EMNLP.

[26]  Timothy Baldwin,et al.  Testing for Significance of Increased Correlation with Human Judgment , 2014, EMNLP.

[27]  Kentaro Inui,et al.  Reference-based Metrics can be Replaced with Reference-less Metrics in Evaluating Grammatical Error Correction Systems , 2017, IJCNLP.

[28]  Martin Chodorow,et al.  Problems in Evaluating Grammatical Error Detection Systems , 2012, COLING.

[29]  Robert Dale,et al.  HOO 2012: A Report on the Preposition and Determiner Error Correction Shared Task , 2012, BEA@NAACL-HLT.