GWU-HASP-2015$@$QALB-2015 Shared Task: Priming Spelling Candidates with Probability

In this paper, we describe our system HASP-2015 (Hybrid Arabic Spelling and Punctuation Corrector) in which we introduce significant improvements over our previous version HASP-2014 and with which we participated in the QALB2015 Second Shared Task on Arabic Error Correction. Our system utilizes probabilistic information on errors and their possible corrections in the training data and combine that with an open-source reference dictionary (or word list) for detecting errors and generating and filtering candidates. We enhance our system further by allowing it to generate candidates for common semantic and grammatical errors. Eventually, an n-gram language model is used for selecting best candidates. We use a CRF (Conditional Random Fields) classifier for correcting punctuation errors in a two-pass process where first the system learns punctuation placement, and then it learns to identify punctuation types.

[1]  Bassam Haddad,et al.  Detection and Correction of Non-Words in Arabic: a Hybrid Approach , 2007, Int. J. Comput. Process. Orient. Lang..

[2]  Hwee Tou Ng,et al.  The CoNLL-2013 Shared Task on Grammatical Error Correction , 2013, CoNLL Shared Task.

[3]  Hwee Tou Ng,et al.  Better Evaluation for Grammatical Error Correction , 2012, NAACL.

[4]  Andreas Stolcke,et al.  SRILM at Sixteen: Update and Outlook , 2011 .

[5]  Josef van Genabith,et al.  Improved Spelling Error Detection and Correction for Arabic , 2012, COLING.

[6]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[7]  Kemal Oflazer,et al.  CMUQ$@$QALB-2014: An SMT-based System for Automatic Arabic Error Correction , 2014, ANLP@EMNLP.

[8]  Michael N. Nawar,et al.  Fast and Robust Arabic Error Correction System , 2014, ANLP@EMNLP.

[9]  Nizar Habash,et al.  MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic , 2014, LREC.

[10]  Mohamed Al-Badrashiny,et al.  Automatic Stochastic Arabic Spelling Correction With Emphasis on Space Insertions and Deletions , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Eric Brill,et al.  An Improved Error Model for Noisy Channel Spelling Correction , 2000, ACL.

[12]  Kenneth Ward Church,et al.  A Spelling Correction Program Based on a Noisy Channel Model , 1990, COLING.

[13]  Nizar Habash,et al.  The Columbia System in the QALB-2014 Shared Task on Arabic Error Correction , 2014, ANLP@EMNLP.

[14]  Kareem Darwish,et al.  Automatic Correction of Arabic Text: a Cascaded Approach , 2014, ANLP@EMNLP.

[15]  Nizar Habash,et al.  The First QALB Shared Task on Automatic Text Correction for Arabic , 2014, ANLP@EMNLP.

[16]  Mans Hulden,et al.  Foma: a Finite-State Compiler and Library , 2009, EACL.

[17]  Seth Kulick Exploiting Separation of Closed-Class Categories for Arabic Tokenization and Part-of-Speech Tagging , 2011, TALIP.

[18]  Nizar Habash,et al.  The Second QALB Shared Task on Automatic Text Correction for Arabic , 2015, ANLP@ACL.

[19]  Ahmed Hassan Awadallah,et al.  Language Independent Text Correction using Finite State Automata , 2008, IJCNLP.

[20]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[21]  Kemal Oflazer,et al.  Correction Annotation for Non-Native Arabic Texts: Guidelines and Corpus , 2015, LAW@NAACL-HLT.

[22]  Kemal Oflazer,et al.  Large Scale Arabic Error Annotation: Guidelines and Framework , 2014, LREC.

[23]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[24]  Mona T. Diab,et al.  GWU-HASP: Hybrid Arabic Spelling and Punctuation Corrector , 2014, ANLP@EMNLP.