论文信息 - From Arabic user-generated content to machinetranslation: integrating automatic errorcorrection

From Arabic user-generated content to machinetranslation: integrating automatic errorcorrection

With the wide spread of the social media and online forums, individual users have been able to actively participate in the generation of online content in different languages and dialects. Arabic is one of the fastest growing languages used on Internet, but dialects (like Egyptian and Saudi Arabian) have a big share of the Arabic online content. There are many differences between Dialectal Arabic and Modern Standard Arabic which cause many challenges for Machine Translation of informal Arabic language. In this paper, we investigate the use of Automatic Error Correction method to improve the quality of Arabic User-Generated texts and its automatic translation. Our experiments show that the new system with automatic correction module outperforms the baseline system by nearly 22.59% of relative improvement.

[1] Andy Way,et al. Standard language variety conversion for content localisation via SMT , 2014, EAMT.

[2] Philipp Koehn,et al. Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[3] Kemal Oflazer,et al. Large Scale Arabic Error Annotation: Guidelines and Framework , 2014, LREC.

[4] Hermann Ney,et al. Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[5] Franz Josef Och,et al. Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[6] William D. Lewis,et al. Intelligent Selection of Language Model Training Data , 2010, ACL.

[7] Kemal Oflazer,et al. CMUQ$@$QALB-2014: An SMT-based System for Automatic Arabic Error Correction , 2014, ANLP@EMNLP.

[8] Mona T. Diab,et al. COLABA : Arabic Dialect Annotation and Processing , 2011 .

[9] Chin-Yew Lin,et al. Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics , 2004, ACL.

[10] Nizar Habash,et al. MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic , 2014, LREC.

[11] Matthew G. Snover,et al. A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[12] Christof Monz,et al. Five Shades of Noise: Analyzing Machine Translation Errors in User-Generated Text , 2015, NUT@IJCNLP.

[13] Holger Schwenk,et al. OCR Error Correction Using Statistical Machine Translation , 2016, Int. J. Comput. Linguistics Appl..

[14] Andy Way,et al. Using SMT for OCR Error Correction of Historical Texts , 2016, LREC.

[15] Josef van Genabith,et al. Improved Spelling Error Detection and Correction for Arabic , 2012, COLING.

[16] Robert L. Mercer,et al. The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.