Extracting a Chinese learner corpus from the web: Grammatical error correction for Learning Chinese as a foreign language with statistical machine translation

In this paper, we describe the TMU system for the shared task of Grammatical Error Diagnosis for Learning Chinese as a Foreign Language (CFL) at NLP-TEA1. One of the main issues in grammatical error correction for CFL is a data bottleneck problem. The Chinese learner corpus at hand (NTNU learner corpus) contains only 1,208 sentences in total, which is obviously insufficient for training supervised techniques. To overcome this problem, we extracted a large-scale Chinese learner corpus from a language exchange site called Lang-8, which results in 95,706 sentences (two million words) after cleaning. We used it as a parallel corpus for a phrase-based statistical machine translation (SMT) system, which translates learner sentences into correct sentences.

[1]  Yuji Matsumoto,et al.  Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners , 2011, IJCNLP.

[2]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[3]  Benjamin Swanson,et al.  Correction Detection and Error Type Selection as an ESL Educational Aid , 2012, HLT-NAACL.

[4]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[5]  Michael Gamon,et al.  Correcting ESL Errors Using Phrasal SMT Techniques , 2006, ACL.

[6]  Yuji Matsumoto,et al.  A Hybrid Chinese Spelling Correction Using Language Model and Statistical Machine Translation with Reranking , 2013, SIGHAN@IJCNLP.

[7]  Yuji Matsumoto,et al.  The Effect of Learner Corpus Size in Grammatical Error Correction of ESL Writings , 2012, COLING.

[8]  Zheng Yuan,et al.  Constrained Grammatical Error Correction using Statistical Machine Translation , 2013, CoNLL Shared Task.

[9]  笠原 誠司,et al.  Error Correcting Romaji-kana Conversion for Japanese Language Education , 2011 .

[10]  Brink van der Merwe,et al.  A Tree Transducer Model for Grammatical Error Correction , 2013, CoNLL Shared Task.

[11]  Yuji Matsumoto,et al.  Towards Automatic Error Type Classification of Japanese Language Learners’ Writings , 2013, PACLIC.

[12]  Yuji Matsumoto,et al.  Tense and Aspect Error Correction for ESL Learners Using Global Context , 2012, ACL.

[13]  Pushpak Bhattacharyya,et al.  Automated Grammar Correction Using Hierarchical Phrase-Based Statistical Machine Translation , 2013, IJCNLP.

[14]  Jason S. Chang,et al.  Integrating Dictionary and Web N-grams for Chinese Spell Checking , 2013, Int. J. Comput. Linguistics Chin. Lang. Process..

[15]  Roger Levy,et al.  Automated Whole Sentence Grammar Correction Using a Noisy Channel Model , 2011, ACL.

[16]  Lung-Hao Lee,et al.  Chinese Spelling Check Evaluation at SIGHAN Bake-off 2013 , 2013, SIGHAN@IJCNLP.

[17]  Marcin Junczys-Dowmunt,et al.  The AMU System in the CoNLL-2014 Shared Task: Grammatical Error Correction by Data-Intensive and Feature-Rich Statistical Machine Translation , 2014, CoNLL Shared Task.