Automatic Identification of Maghreb Dialects Using a Dictionary-Based Approach

Automatic identification of Arabic dialects in a text is a difficult task, especially for Maghreb languages and when they are written in Arabic or Latin characters (Arabizi). These texts are characterized by the use of code-switching between the Modern Standard Arabic (MSA) and the Arabic Dialect (AD) in the texts written in Arabic, or between Arabizi and foreign languages for those written in Latin. This paper presents the specific resources and tools we have developed for this purpose, with a focus on the transliteration of Arabizi into Arabic (using the dedicated tools for Arabic dialects). A dictionary-based approach to detect the dialectal origin of a text is described, it exhibits satisfactory results.

[1]  Mona T. Diab,et al.  AIDA: Identifying Code Switching in Informal Arabic Text , 2014, CodeSwitch@EMNLP.

[2]  Mona T. Diab,et al.  Simplified guidelines for the creation of Large Scale Dialectal Arabic Annotations , 2012, LREC.

[3]  Ryan Cotterell,et al.  A Multi-Dialect, Multi-Genre Corpus of Informal Written Arabic , 2014, LREC.

[4]  Chris Callison-Burch,et al.  The Arabic Online Commentary Dataset: an Annotated Dataset of Informal Arabic with High Dialectal Content , 2011, ACL.

[5]  Percy Bysshe Shelley From the Arabic , 2013 .

[6]  Abdulhadi Shoufan,et al.  Natural Language Processing for Dialectical Arabic: A Survey , 2015, ANLP@ACL.

[7]  Nizar Habash,et al.  Conventional Orthography for Dialectal Arabic , 2012, LREC.

[8]  Richard Johansson,et al.  Automatic Detection of Arabicized Berber and Arabic Varieties , 2016, VarDial@COLING.

[9]  H. Saadane,et al.  Transcription of Arabic names into Latin , 2012, 2012 6th International Conference on Sciences of Electronics, Technologies of Information and Telecommunications (SETIT).

[10]  Mervat Ibrahim The Arabic Language , 2012 .

[11]  Heba Elfardy,et al.  AIDA: Automatic Identification and Glossing of Dialectal Arabic , 2012, EAMT.

[12]  Nizar Habash,et al.  Introduction to Arabic Natural Language Processing , 2010, Introduction to Arabic Natural Language Processing.

[13]  Nizar Habash,et al.  Palestinian Arabic Conventional Orthography Guidelines , 2016 .

[14]  Yassine Benajiba,et al.  ANERsys 2.0: Conquering the NER Task for the Arabic Language by Combining the Maximum Entropy with POS-tag Information , 2007, IICAI.

[15]  Nizar Habash,et al.  Foreign Words and the Automatic Processing of Arabic Social Media Text Written in Roman Script , 2014, CodeSwitch@EMNLP.

[16]  Yi Yang,et al.  A Log-Linear Model for Unsupervised Text Normalization , 2013, EMNLP.

[17]  Nizar Habash,et al.  On Arabic Transliteration , 2007 .

[18]  Kareem Darwish,et al.  Arabizi Detection and Conversion to Arabic , 2013, ANLP@EMNLP.

[19]  Lamia Hadrich Belguith,et al.  Mapping Rules for Building a Tunisian Dialect Lexicon and Generating Corpora , 2013, IJCNLP.

[20]  Khaled Shaalan,et al.  Transferring Egyptian Colloquial Dialect into Modern Standard Arabic , 2007 .

[21]  Mona T. Diab,et al.  Sentence Level Dialect Identification in Arabic , 2013, ACL.

[22]  Nizar Habash,et al.  A Conventional Orthography for Algerian Arabic , 2015, ANLP@ACL.

[23]  Roxana Girju,et al.  YADAC: Yet another Dialectal Arabic Corpus , 2012, LREC.

[24]  Houcemeddine Turki,et al.  A Conventional Orthography for Maghrebi Arabic , 2016, LREC 2016.

[25]  Houda Saadane,et al.  Le traitement automatique de l’arabe dialectalisé : aspects méthodologiques et algorithmiques , 2015 .

[26]  Nizar Habash,et al.  A Conventional Orthography for Tunisian Arabic , 2014, LREC.