Machine Translation Experiments on PADIC: A Parallel Arabic DIalect Corpus

We present in this paper PADIC, a Parallel Arabic DIalect Corpus we built from scratch, then we conducted experiments on crossdialect Arabic machine translation. PADIC is composed of dialects from both the Maghreb and the Middle-East. Each dialect has been aligned with Modern Standard Arabic (MSA). Three dialects from Maghreb are concerned by this study: two from Algeria, one from Tunisia, and two dialects from the MiddleEast (Syria and Palestine). PADIC has been built from scratch because the lack of dialect resources. In fact, Arabic dialects in Arab world in general are used in daily life conversations but they are not written. At the best of our knowledge, PADIC, up to now, is the largest corpus in the community working on dialects and especially those concerning Maghreb. PADIC is composed of 6400 sentences for each of the 5 concerned dialects and MSA. We conducted cross-lingual machine translation experiments between all the language pairs. For translating to MSA we interpolated the corresponding Language Model (LM) with a large Arabic corpus based LM. We also studied the impact of language model smoothing techniques on the results of machine translation because this corpus, even it is the largest one, it still very small in comparison to those used for translation of natural languages.

[1]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[2]  Mona T. Diab,et al.  Sentence Level Dialect Identification in Arabic , 2013, ACL.

[3]  Jeff A. Bilmes,et al.  Novel approaches to Arabic speech recognition: report from the 2002 Johns-Hopkins Summer Workshop , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[4]  H. Sawaf Arabic Dialect Handling in Hybrid Machine Translation , 2010, AMTA.

[5]  Nizar Habash,et al.  Dialectal Arabic to English Machine Translation: Pivoting through Modern Standard Arabic , 2013, NAACL.

[6]  Chris Callison-Burch,et al.  Machine Translation of Arabic Dialects , 2012, NAACL.

[7]  Nizar Habash,et al.  Morphological Analysis and Disambiguation for Dialectal Arabic , 2013, NAACL.

[8]  Kemal Oflazer,et al.  A Multidialectal Parallel Corpus of Arabic , 2014, LREC.

[9]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[10]  Karima Meftouh,et al.  Cross-Dialectal Arabic Processing , 2015, CICLing.

[11]  Nizar Habash,et al.  Elissa: A Dialectal to Standard Arabic Machine Translation System , 2012, COLING.

[12]  Martin Kay,et al.  Morphological Analysis , 1973, COLING.

[13]  Nizar Habash,et al.  Parsing Arabic Dialects , 2006, EACL.

[14]  Karima Meftouh,et al.  Building resources for Algerian Arabic dialects , 2014, INTERSPEECH.

[15]  Nizar Habash,et al.  MAGEAD: A Morphological Analyzer and Generator for the Arabic Dialects , 2006, ACL.

[16]  Inguna Skadina,et al.  A Collection of Comparable Corpora for Under-resourced Languages , 2010, Baltic HLT.

[17]  Nizar Habash,et al.  MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic , 2014, LREC.

[18]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[19]  Karima Meftouh,et al.  Diacritics restoration for Arabic dialect texts , 2013, INTERSPEECH.

[20]  Karim Bouzoubaa,et al.  A hybrid approach to translate Moroccan Arabic dialect , 2014, 2014 9th International Conference on Intelligent Systems: Theories and Applications (SITA-14).