A Multidialectal Parallel Corpus of Arabic

The daily spoken variety of Arabic is often termed the colloquial or dialect form of Arabic. There are many Arabic dialects across the Arab World and within other Arabic speaking communities. These dialects vary widely from region to region and to a lesser extent from city to city in each region. The dialects are not standardized, they are not taught, and they do not have official status. However they are the primary vehicles of communication (face-to-face and recently, online) and have a large presence in the arts as well. In this paper, we present the first multidialectal Arabic parallel corpus, a collection of 2,000 sentences in Standard Arabic, Egyptian, Tunisian, Jordanian, Palestinian and Syrian Arabic, in addition to English. Such parallel data does not exist naturally, which makes this corpus a very valuable resource that has many potential applications such as Arabic dialect identification and machine translation.

[1]  Nizar Habash,et al.  Morphological Analysis and Disambiguation for Dialectal Arabic , 2013, NAACL.

[2]  Abdel-Rahman H. Abu-Melhim Code-Switching and Linguistic Accommodation in Arabic , 1991 .

[3]  Nizar Habash,et al.  MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic , 2014, LREC.

[4]  Lamia Hadrich Belguith,et al.  Mapping Rules for Building a Tunisian Dialect Lexicon and Generating Corpora , 2013, IJCNLP.

[5]  Nizar Habash,et al.  Parsing Arabic Dialects , 2006, EACL.

[6]  Nizar Habash,et al.  Conventional Orthography for Dialectal Arabic , 2012, LREC.

[7]  Chris Callison-Burch,et al.  Crowdsourcing Translation: Professional Quality from Non-Professionals , 2011, ACL.

[8]  Nizar Habash,et al.  50th Annual Meeting of the Association for Computational Linguistics Proceedings of the Conference Volume 2: Short Papers , 2012 .

[9]  Nizar Habash,et al.  On Arabic Transliteration , 2007 .

[10]  Nizar Habash,et al.  Dialectal Arabic to English Machine Translation: Pivoting through Modern Standard Arabic , 2013, NAACL.

[11]  Nizar Habash,et al.  Introduction to Arabic Natural Language Processing , 2010, Introduction to Arabic Natural Language Processing.

[12]  Chris Callison-Burch,et al.  Machine Translation of Arabic Dialects , 2012, NAACL.

[13]  Kareem Darwish,et al.  Arabizi Detection and Conversion to Arabic , 2013, ANLP@EMNLP.

[14]  H. Sawaf Arabic Dialect Handling in Hybrid Machine Translation , 2010, AMTA.

[15]  Nizar Habash,et al.  A Conventional Orthography for Tunisian Arabic , 2014, LREC.

[16]  David Yarowsky,et al.  Minimally Supervised Morphological Segmentation with Applications to Machine Translation , 2006, AMTA.

[17]  Yonatan Belinkov,et al.  Translating Dialectal Arabic to English , 2013, ACL.

[18]  Niloofar Haeri Sociolinguistic variation in Cairene Arabic : palatalization and the "qaf" in the speech of men and women , 1991 .

[19]  K. Brustad The Syntax of Spoken Arabic: A Comparative Study of Moroccan, Egyptian, Syrian, and Kuwaiti Dialects. , 2002 .

[20]  Nizar Habash,et al.  Arabic Dialect Processing Tutorial , 2012, HLT-NAACL.

[21]  Mona T. Diab,et al.  Sentence Level Dialect Identification in Arabic , 2013, ACL.

[22]  J. McCarthy The phonology and morphology of Arabic , 2004 .

[23]  Chris Callison-Burch,et al.  Arabic Dialect Identification , 2014, CL.

[24]  Roxana Girju,et al.  Mining the Web for the Induction of a Dialectical Arabic Lexicon , 2010, LREC.