Neural Grammatical Error Correction for Romanian

Resources for Grammatical Error Correction (GEC) in non-English languages are scarce, while available spellcheckers in these languages are mostly limited to simple corrections and rules. In this paper we introduce a first GEC corpus for Romanian consisting of 10k pairs of sentences. In addition, the German version of ERRANT (ERRor ANnotation Toolkit) scorer was adapted for Romanian to analyze this corpus and extract edits needed for evaluation. Multiple neural models were experimented, together with pretraining strategies, which proved effective for GEC in low-resource settings. Our baseline consists of a small Transformer model trained only on the GEC dataset ($F_{0.5}=44.38$), whereas the best performing model is produced by pretraining a larger Transformer model on artificially generated data, followed by finetuning on the actual corpus ($F_{0.5}=53.76$). The proposed method for generating additional training examples is easily extensible and can be applied to any language, as it requires only a POS tagger.

[1]  Milan Straka,et al.  Grammatical Error Correction in Low-Resource Scenarios , 2019, EMNLP.

[2]  Benjamin Swanson,et al.  Correction Detection and Error Type Selection as an ESL Educational Aid , 2012, HLT-NAACL.

[3]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[4]  Dan Roth,et al.  Grammar Error Correction in Morphologically Rich Languages: The Case of Russian , 2019, TACL.

[5]  Marcin Junczys-Dowmunt,et al.  Phrase-based Machine Translation is State-of-the-Art for Automatic Grammatical Error Correction , 2016, EMNLP.

[6]  Jakub Náplava,et al.  Natural Language Correction , 2017 .

[7]  Kentaro Inui,et al.  An Empirical Study of Incorporating Pseudo Data into Grammatical Error Correction , 2019, EMNLP.

[8]  Philipp Koehn,et al.  Scalable Modified Kneser-Ney Language Model Estimation , 2013, ACL.

[9]  Marcin Junczys-Dowmunt,et al.  Neural Grammatical Error Correction Systems with Unsupervised Pre-training on Synthetic Data , 2019, BEA@ACL.

[10]  Raymond Hendy Susanto,et al.  The CoNLL-2014 Shared Task on Grammatical Error Correction , 2014 .

[11]  Wei Zhao,et al.  Improving Grammatical Error Correction via Pre-Training a Copy-Augmented Architecture with Unlabeled Data , 2019, NAACL.

[12]  Marcin Junczys-Dowmunt,et al.  Approaching Neural Grammatical Error Correction as a Low-Resource Machine Translation Task , 2018, NAACL.

[13]  Dan Tufis,et al.  DIAC+: a Professional Diacritics Recovering System , 2008, LREC.

[14]  Yuji Matsumoto,et al.  Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners , 2011, IJCNLP.

[15]  Joel R. Tetreault,et al.  JFLEG: A Fluency Corpus and Benchmark for Grammatical Error Correction , 2017, EACL.

[16]  Stefan Trausan-Matu,et al.  ReaderBench goes Online: A Comprehension-Centered Framework for Educational Purposes , 2016, RoCHI.

[17]  Noam M. Shazeer,et al.  Corpora Generation for Grammatical Error Correction , 2019, NAACL.

[18]  Daniel Jurafsky,et al.  Neural Language Correction with Character-Based Attention , 2016, ArXiv.

[19]  Ted Briscoe,et al.  The BEA-2019 Shared Task on Grammatical Error Correction , 2019, BEA@ACL.

[20]  Kemal Oflazer,et al.  Large Scale Arabic Error Annotation: Guidelines and Framework , 2014, LREC.

[21]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[22]  Zheng Yuan,et al.  Constrained Grammatical Error Correction using Statistical Machine Translation , 2013, CoNLL Shared Task.

[23]  Ted Briscoe,et al.  Language Model Based Grammatical Error Correction without Annotated Training Data , 2018, BEA@NAACL-HLT.

[24]  Tie-Yan Liu,et al.  Incorporating BERT into Neural Machine Translation , 2020, ICLR.

[25]  Adriane Boyd,et al.  Using Wikipedia Edits in Low Resource Grammatical Error Correction , 2018, NUT@EMNLP.

[26]  Ted Briscoe,et al.  Automatic Annotation and Evaluation of Error Types for Grammatical Error Correction , 2017, ACL.

[27]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[28]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[29]  Kentaro Inui,et al.  Encoder-Decoder Models Can Benefit from Pre-trained Masked Language Models in Grammatical Error Correction , 2020, ACL.