论文信息 - Reference-based Metrics can be Replaced with Reference-less Metrics in Evaluating Grammatical Error Correction Systems

Reference-based Metrics can be Replaced with Reference-less Metrics in Evaluating Grammatical Error Correction Systems

In grammatical error correction (GEC), automatically evaluating system outputs requires gold-standard references, which must be created manually and thus tend to be both expensive and limited in coverage. To address this problem, a reference-less approach has recently emerged; however, previous reference-less metrics that only consider the criterion of grammaticality, have not worked as well as reference-based metrics. This study explores the potential of extending a prior grammaticality-based method to establish a reference-less evaluation method for GEC systems. Further, we empirically show that a reference-less metric that combines fluency and meaning preservation with grammaticality provides a better estimate of manual scores than that of commonly used reference-based metrics. To our knowledge, this is the first study that provides empirical evidence that a reference-less metric can replace reference-based metrics in evaluating GEC systems.

[1] Raymond Hendy Susanto,et al. The CoNLL-2014 Shared Task on Grammatical Error Correction , 2014 .

[2] Hwee Tou Ng,et al. The CoNLL-2013 Shared Task on Grammatical Error Correction , 2013, CoNLL Shared Task.

[3] Hwee Tou Ng,et al. Better Evaluation for Grammatical Error Correction , 2012, NAACL.

[4] Alon Lavie,et al. Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[5] Ted Briscoe,et al. Towards a standard evaluation method for grammatical error detection and correction , 2015, NAACL.

[6] Martin Chodorow,et al. TOEFL11: A CORPUS OF NON‐NATIVE ENGLISH , 2013 .

[7] Marcin Junczys-Dowmunt,et al. Human Evaluation of Grammatical Error Correction Systems , 2015, EMNLP.

[8] Jimmy J. Lin,et al. Pairwise Word Interaction Modeling with Deep Neural Networks for Semantic Similarity Measurement , 2016, NAACL.

[9] Joel R. Tetreault,et al. There’s No Comparison: Reference-less Evaluation Metrics in Grammatical Error Correction , 2016, EMNLP.

[10] Alexander Clark,et al. Unsupervised Prediction of Acceptability Judgements , 2015, ACL.

[11] Matt Post,et al. Reassessing the Goals of Grammatical Error Correction: Fluency Instead of Grammaticality , 2016, TACL.

[12] Nitin Madnani,et al. Predicting Grammaticality on an Ordinal Scale , 2014, ACL.

[13] Jeremy H. Clear,et al. The British national corpus , 1993 .

[14] Piotr Andruszkiewicz,et al. Samsung Poland NLP Team at SemEval-2016 Task 1: Necessity for diversity; combining recursive autoencoders, WordNet and ensemble methods to measure semantic similarity. , 2016, *SEMEVAL.

[15] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[16] Hwee Tou Ng,et al. How Far are We from Fully Automatic High Quality Grammatical Error Correction? , 2015, ACL.

[17] Matt Post,et al. Ground Truth for Grammatical Error Correction Metrics , 2015, ACL.

[18] Joel R. Tetreault,et al. JFLEG: A Fluency Corpus and Benchmark for Grammatical Error Correction , 2017, EACL.