Reference-based Metrics can be Replaced with Reference-less Metrics in Evaluating Grammatical Error Correction Systems

In grammatical error correction (GEC), automatically evaluating system outputs requires gold-standard references, which must be created manually and thus tend to be both expensive and limited in coverage. To address this problem, a reference-less approach has recently emerged; however, previous reference-less metrics that only consider the criterion of grammaticality, have not worked as well as reference-based metrics. This study explores the potential of extending a prior grammaticality-based method to establish a reference-less evaluation method for GEC systems. Further, we empirically show that a reference-less metric that combines fluency and meaning preservation with grammaticality provides a better estimate of manual scores than that of commonly used reference-based metrics. To our knowledge, this is the first study that provides empirical evidence that a reference-less metric can replace reference-based metrics in evaluating GEC systems.

[1]  Raymond Hendy Susanto,et al.  The CoNLL-2014 Shared Task on Grammatical Error Correction , 2014 .

[2]  Hwee Tou Ng,et al.  The CoNLL-2013 Shared Task on Grammatical Error Correction , 2013, CoNLL Shared Task.

[3]  Hwee Tou Ng,et al.  Better Evaluation for Grammatical Error Correction , 2012, NAACL.

[4]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[5]  Ted Briscoe,et al.  Towards a standard evaluation method for grammatical error detection and correction , 2015, NAACL.

[6]  Martin Chodorow,et al.  TOEFL11: A CORPUS OF NON‐NATIVE ENGLISH , 2013 .

[7]  Marcin Junczys-Dowmunt,et al.  Human Evaluation of Grammatical Error Correction Systems , 2015, EMNLP.

[8]  Jimmy J. Lin,et al.  Pairwise Word Interaction Modeling with Deep Neural Networks for Semantic Similarity Measurement , 2016, NAACL.

[9]  Joel R. Tetreault,et al.  There’s No Comparison: Reference-less Evaluation Metrics in Grammatical Error Correction , 2016, EMNLP.

[10]  Alexander Clark,et al.  Unsupervised Prediction of Acceptability Judgements , 2015, ACL.

[11]  Matt Post,et al.  Reassessing the Goals of Grammatical Error Correction: Fluency Instead of Grammaticality , 2016, TACL.

[12]  Nitin Madnani,et al.  Predicting Grammaticality on an Ordinal Scale , 2014, ACL.

[13]  Jeremy H. Clear,et al.  The British national corpus , 1993 .

[14]  Piotr Andruszkiewicz,et al.  Samsung Poland NLP Team at SemEval-2016 Task 1: Necessity for diversity; combining recursive autoencoders, WordNet and ensemble methods to measure semantic similarity. , 2016, *SEMEVAL.

[15]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[16]  Hwee Tou Ng,et al.  How Far are We from Fully Automatic High Quality Grammatical Error Correction? , 2015, ACL.

[17]  Matt Post,et al.  Ground Truth for Grammatical Error Correction Metrics , 2015, ACL.

[18]  Joel R. Tetreault,et al.  JFLEG: A Fluency Corpus and Benchmark for Grammatical Error Correction , 2017, EACL.