A Pipeline Approach to Supervised Error Correction for the QALB-2014 Shared Task

This paper describes our submission to the ANLP-2014 shared task on automatic Arabic error correction. We present a pipeline approach integrating an error detection model, a combination of character- and word-level translation models, a reranking model and a punctuation insertion model. We achieve an F1 score of 62.8% on the development set of the QALB corpus, and 58.6% on the official test set.

[1]  Desmond Darma Putra,et al.  UdS at CoNLL 2013 Shared Task , 2013, CoNLL Shared Task.

[2]  Nizar Habash,et al.  MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic , 2014, LREC.

[3]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[4]  Kemal Oflazer,et al.  Large Scale Arabic Error Annotation: Guidelines and Framework , 2014, LREC.

[5]  Brink van der Merwe,et al.  A Tree Transducer Model for Grammatical Error Correction , 2013, CoNLL Shared Task.

[6]  Hwee Tou Ng,et al.  Better Evaluation for Grammatical Error Correction , 2012, NAACL.

[7]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[8]  Yuji Matsumoto,et al.  Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners , 2011, IJCNLP.

[9]  François Yvon,et al.  Rewriting the orthography of SMS messages , 2010, Natural Language Engineering.

[10]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[11]  Tie-Yan Liu,et al.  Learning to Rank for Information Retrieval , 2011 .

[12]  Michael Gamon,et al.  Correcting ESL Errors Using Phrasal SMT Techniques , 2006, ACL.

[13]  Nizar Habash,et al.  Using Deep Morphology to Improve Automatic Error Detection in Arabic Handwriting Recognition , 2011, ACL.

[14]  Nizar Habash,et al.  Generalized Character-Level Spelling Error Correction , 2014, ACL.

[15]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[16]  Adam Kilgarriff,et al.  Helping Our Own: Text Massaging for Computational Linguistics as a New Shared Task , 2010, INLG.

[17]  Nizar Habash,et al.  MADA + TOKAN : A Toolkit for Arabic Tokenization , Diacritization , Morphological Disambiguation , POS Tagging , Stemming and Lemmatization , 2009 .

[18]  Hai Zhao,et al.  Grammatical Error Correction as Multiclass Classification with Single Model , 2013, CoNLL Shared Task.

[19]  Veronika Vincze,et al.  LFG-based Features for Noun Number and Article Grammatical Errors , 2013, CoNLL Shared Task.

[20]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[21]  Dan Roth,et al.  The University of Illinois System in the CoNLL-2013 Shared Task , 2013, CoNLL Shared Task.

[22]  Fernando Pereira,et al.  Weighted finite-state transducers in speech recognition , 2002, Comput. Speech Lang..

[23]  Zheng Yuan,et al.  Constrained Grammatical Error Correction using Statistical Machine Translation , 2013, CoNLL Shared Task.

[24]  Hwee Tou Ng,et al.  A Beam-Search Decoder for Grammatical Error Correction , 2012, EMNLP.

[25]  Hwee Tou Ng,et al.  Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task, CoNLL 2013, Sofia, Bulgaria, August 8-9, 2013 , 2013, CoNLL Shared Task.

[26]  Roman Grundkiewicz,et al.  Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task, CoNLL 2014, Baltimore, Maryland, USA, June 26-27, 2014 , 2014, CoNLL Shared Task.

[27]  Nizar Habash,et al.  Introduction to Arabic Natural Language Processing , 2010, Introduction to Arabic Natural Language Processing.

[28]  Nizar Habash,et al.  The First QALB Shared Task on Automatic Text Correction for Arabic , 2014, ANLP@EMNLP.

[29]  Nizar Habash,et al.  Reranking with Linguistic and Semantic Features for Arabic Optical Character Recognition , 2013, ACL.

[30]  Hwee Tou Ng,et al.  The CoNLL-2013 Shared Task on Grammatical Error Correction , 2013, CoNLL Shared Task.

[31]  Preslav Nakov,et al.  Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages , 2012, ACL.

[32]  Johan Schalkwyk,et al.  OpenFst: A General and Efficient Weighted Finite-State Transducer Library , 2007, CIAA.

[33]  Hae-Chang Rim,et al.  KUNLP Grammatical Error Correction System For CoNLL-2013 Shared Task , 2013, CoNLL Shared Task.

[34]  Nizar Habash,et al.  Four Techniques for Online Handling of Out-of-Vocabulary Words in Arabic-English Statistical Machine Translation , 2008, ACL.

[35]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[36]  Grigori Sidorov,et al.  Rule-based System for Automatic Grammar Correction Using Syntactic N-grams for English Language Learning (L2) , 2013, CoNLL Shared Task.