Factored Statistical Machine Translation for Grammatical Error Correction

This paper describes our ongoing work on grammatical error correction (GEC). Focusing on all possible error types in a real-life environment, we propose a factored statistical machine translation (SMT) model for this task. We consider error correction as a series of language translation problems guided by various linguistic information, as factors that influence translation results. Factors included in our study are morphological information, i.e. word stem, prefix, suffix, and Part-of-Speech (PoS) information. In addition, we also experimented with different combinations of translation models (TM), phrase-based and factor-based, trained on various datasets to boost the overall performance. Empirical results show that the proposed model yields an improvement of 32.54% over a baseline phrase-based SMT model. The system participated in the CoNLL 2014 shared task and achieved the 7 th and 5 th F0.5 scores 1 on the official test set among the thirteen participating teams.

[1]  Dan Klein,et al.  Fast Exact Inference with a Factored Model for Natural Language Parsing , 2002, NIPS.

[2]  Yuji Matsumoto,et al.  Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners , 2011, IJCNLP.

[3]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[4]  Rémi Eyraud,et al.  Proceedings of CoNLL , 2006 .

[5]  Michael Gamon,et al.  Correcting ESL Errors Using Phrasal SMT Techniques , 2006, ACL.

[6]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[7]  Veronika Vincze,et al.  LFG-based Features for Noun Number and Article Grammatical Errors , 2013, CoNLL Shared Task.

[8]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[9]  Yuji Matsumoto,et al.  NAIST at 2013 CoNLL Grammatical Error Correction Shared Task , 2013, CoNLL Shared Task.

[10]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[11]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[12]  Hwee Tou Ng,et al.  Building a Large Annotated Corpus of Learner English: The NUS Corpus of Learner English , 2013, BEA@NAACL-HLT.

[13]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[14]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[15]  Heshaam Faili,et al.  Grammatical and context‐sensitive error correction using a statistical machine translation framework , 2013, Softw. Pract. Exp..

[16]  Raymond Hendy Susanto,et al.  The CoNLL-2014 Shared Task on Grammatical Error Correction , 2014 .

[17]  Philipp Koehn,et al.  Factored Translation Models , 2007, EMNLP.

[18]  Hwee Tou Ng,et al.  The CoNLL-2013 Shared Task on Grammatical Error Correction , 2013, CoNLL Shared Task.

[19]  Zheng Yuan,et al.  Constrained Grammatical Error Correction using Statistical Machine Translation , 2013, CoNLL Shared Task.

[20]  Xiaodong Zeng,et al.  UM-Checker: A Hybrid System for English Grammatical Error Correction , 2013, CoNLL Shared Task.

[21]  Desmond Darma Putra,et al.  UdS at CoNLL 2013 Shared Task , 2013, CoNLL Shared Task.

[22]  Grigori Sidorov,et al.  Rule-based System for Automatic Grammar Correction Using Syntactic N-grams for English Language Learning (L2) , 2013, CoNLL Shared Task.

[23]  Robert Dale,et al.  HOO 2012: A Report on the Preposition and Determiner Error Correction Shared Task , 2012, BEA@NAACL-HLT.