From Arabic user-generated content to machinetranslation: integrating automatic errorcorrection

With the wide spread of the social media and online forums, individual users have been able to actively participate in the generation of online content in different languages and dialects. Arabic is one of the fastest growing languages used on Internet, but dialects (like Egyptian and Saudi Arabian) have a big share of the Arabic online content. There are many differences between Dialectal Arabic and Modern Standard Arabic which cause many challenges for Machine Translation of informal Arabic language. In this paper, we investigate the use of Automatic Error Correction method to improve the quality of Arabic User-Generated texts and its automatic translation. Our experiments show that the new system with automatic correction module outperforms the baseline system by nearly 22.59% of relative improvement.

[1]  Andy Way,et al.  Standard language variety conversion for content localisation via SMT , 2014, EAMT.

[2]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[3]  Kemal Oflazer,et al.  Large Scale Arabic Error Annotation: Guidelines and Framework , 2014, LREC.

[4]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[5]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[6]  William D. Lewis,et al.  Intelligent Selection of Language Model Training Data , 2010, ACL.

[7]  Kemal Oflazer,et al.  CMUQ$@$QALB-2014: An SMT-based System for Automatic Arabic Error Correction , 2014, ANLP@EMNLP.

[8]  Mona T. Diab,et al.  COLABA : Arabic Dialect Annotation and Processing , 2011 .

[9]  Chin-Yew Lin,et al.  Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics , 2004, ACL.

[10]  Nizar Habash,et al.  MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic , 2014, LREC.

[11]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[12]  Christof Monz,et al.  Five Shades of Noise: Analyzing Machine Translation Errors in User-Generated Text , 2015, NUT@IJCNLP.

[13]  Holger Schwenk,et al.  OCR Error Correction Using Statistical Machine Translation , 2016, Int. J. Comput. Linguistics Appl..

[14]  Andy Way,et al.  Using SMT for OCR Error Correction of Historical Texts , 2016, LREC.

[15]  Josef van Genabith,et al.  Improved Spelling Error Detection and Correction for Arabic , 2012, COLING.

[16]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[17]  Kenneth Heafield,et al.  KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[18]  Nizar Habash,et al.  Morphological Analysis and Disambiguation for Dialectal Arabic , 2013, NAACL.

[19]  Hermann Ney,et al.  Discriminative Training and Maximum Entropy Models for Statistical Machine Translation , 2002, ACL.

[20]  Chris Callison-Burch,et al.  Open Source Toolkit for Statistical Machine Translation: Factored Translation Models and Lattice Decoding , 2006 .

[21]  Nizar Habash,et al.  Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop , 2005, ACL.

[22]  Tao Hong,et al.  Degraded text recognition using visual and linguistic context , 1996 .

[23]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[24]  Yasuharu Shimeki,et al.  Postprocessing for Character Recognition Using Keyword Information , 1992, MVA.

[25]  Youssef Bassil,et al.  OCR Post-Processing Error Correction Algorithm using Google Online Spelling Suggestion , 2012, ArXiv.

[26]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[27]  Walid Magdy,et al.  Arabic OCR Error Correction Using Character Segment Correction, Language Modeling, and Shallow Morphology , 2006, EMNLP.

[28]  Robert L. Mercer,et al.  Context based spelling correction , 1991, Inf. Process. Manag..

[29]  Kemal Oflazer,et al.  Dudley North visits North London: Learning When to Transliterate to Arabic , 2013, HLT-NAACL.

[30]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[31]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .