Improving Grammatical Error Correction with Machine Translation Pairs

We propose a novel data synthesis method to generate diverse error-corrected sentence pairs for improving grammatical error correction, which is based on a pair of machine translation models of different qualities (i.e., poor and good). The poor translation model resembles the ESL (English as a second language) learner and tends to generate translations of low quality in terms of fluency and grammatical correctness, while the good translation model generally generates fluent and grammatically correct translations. We build the poor and good translation model with phrase-based statistical machine translation model with decreased language model weight and neural machine translation model respectively. By taking the pair of their translations of the same sentences in a bridge language as error-corrected sentence pairs, we can construct unlimited pseudo parallel data. Our approach is capable of generating diverse fluency-improving patterns without being limited by the pre-defined rule set and the seed error-corrected data. Experimental results demonstrate the effectiveness of our approach and show that it can be combined with other synthetic data sources to yield further improvements.

[1]  Marcin Junczys-Dowmunt,et al.  Neural Grammatical Error Correction Systems with Unsupervised Pre-training on Synthetic Data , 2019, BEA@ACL.

[2]  Ali Derakhshan,et al.  The Interference of First Language and Second Language Acquisition , 2015 .

[3]  Yun Chen,et al.  Controllable Data Synthesis Method for Grammatical Error Correction , 2019, ArXiv.

[4]  Ted Briscoe,et al.  The BEA-2019 Shared Task on Grammatical Error Correction , 2019, BEA@ACL.

[5]  Kentaro Inui,et al.  An Empirical Study of Incorporating Pseudo Data into Grammatical Error Correction , 2019, EMNLP.

[6]  Jennifer Foster,et al.  GenERRate: Generating Errors for Use in Grammatical Error Detection , 2009, BEA@NAACL.

[7]  Helen Yannakoudakis,et al.  Grammatical error correction using hybrid systems and type filtering , 2014, CoNLL Shared Task.

[8]  Alan W Black,et al.  Towards Minimal Supervision BERT-based Grammar Error Correction , 2020, AAAI.

[9]  Ted Briscoe,et al.  Artificial Error Generation with Machine Translation and Syntactic Patterns , 2017, BEA@EMNLP.

[10]  Nitin Madnani,et al.  Robust Systems for Preposition Error Correction Using Wikipedia Revisions , 2013, NAACL.

[11]  Myle Ott,et al.  Understanding Back-Translation at Scale , 2018, EMNLP.

[12]  Ming Zhou,et al.  Fluency Boost Learning and Inference for Neural Grammatical Error Correction , 2018, ACL.

[13]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[14]  Shashi Narayan,et al.  Local String Transduction as Sequence Labeling , 2018, COLING.

[15]  Hwee Tou Ng,et al.  Building a Large Annotated Corpus of Learner English: The NUS Corpus of Learner English , 2013, BEA@NAACL-HLT.

[16]  Yuji Matsumoto,et al.  Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners , 2011, IJCNLP.

[17]  Noam M. Shazeer,et al.  Corpora Generation for Grammatical Error Correction , 2019, NAACL.

[18]  Min Zhang,et al.  Neural Machine Translation Advised by Statistical Machine Translation , 2016, AAAI.

[19]  Siriluck Usaha,et al.  Thai EFL Students' Writing Errors in Different Text Types: The Interference of the First Language , 2012 .

[20]  Kenneth Heafield,et al.  KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[21]  Mamoru Komachi,et al.  (Almost) Unsupervised Grammatical Error Correction using Synthetic Comparable Corpus , 2019, BEA@ACL.

[22]  Stephan Vogel,et al.  Parallel Implementations of Word Alignment Tool , 2008, SETQALNLP.

[23]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[24]  Sylviane Granger,et al.  The computer learner corpus: a versatile new source of data for SLA research , 1998 .

[25]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[26]  Nitin Madnani,et al.  Exploring Grammatical Error Correction with Not-So-Crummy Machine Translation , 2012, BEA@NAACL-HLT.

[27]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[28]  Yo Joong Choe,et al.  A Neural Grammatical Error Correction System Built On Better Pre-training and Sequential Transfer Learning , 2019, BEA@ACL.

[29]  Marcin Junczys-Dowmunt,et al.  The United Nations Parallel Corpus v1.0 , 2016, LREC.

[30]  Ke Xu,et al.  Pseudo-Bidirectional Decoding for Local Sequence Transduction , 2020, FINDINGS.

[31]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[32]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[33]  Marcin Junczys-Dowmunt,et al.  Approaching Neural Grammatical Error Correction as a Low-Resource Machine Translation Task , 2018, NAACL.

[34]  Ming Zhou,et al.  BERT-based Lexical Substitution , 2019, ACL.

[35]  Baljit Bhela Native language interference in learning a second language: Exploratory case studies of native language interference with target language usage , 1999 .

[36]  Daniel Jurafsky,et al.  Noising and Denoising Natural Language: Diverse Backtranslation for Grammar Correction , 2018, NAACL.

[37]  Ted Briscoe,et al.  Automatic Annotation and Evaluation of Error Types for Grammatical Error Correction , 2017, ACL.

[38]  Furu Wei,et al.  Sequence-to-sequence Pre-training with Data Augmentation for Sentence Rewriting , 2019, ArXiv.

[39]  Wei Zhao,et al.  Improving Grammatical Error Correction via Pre-Training a Copy-Augmented Architecture with Unlabeled Data , 2019, NAACL.

[40]  Artem Chernodub,et al.  GECToR – Grammatical Error Correction: Tag, Not Rewrite , 2020, BEA.

[41]  Jungyeul Park,et al.  Artificial Error Generation with Fluency Filtering , 2019, BEA@ACL.

[42]  Hwee Tou Ng,et al.  The CoNLL-2013 Shared Task on Grammatical Error Correction , 2013, CoNLL Shared Task.

[43]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[44]  Helen Yannakoudakis,et al.  A New Dataset and Method for Automatically Grading ESOL Texts , 2011, ACL.