Automatic Identification of Moroccan Colloquial Arabic

Language Identification is an NLP task which aims at predicting the language of a given text. For the Arabic dialects many attempts have been done to address this topic. In this paper, we present our approach to build a Language Identification system in order to distinguish between Moroccan Colloquial Arabic and Arabic languages using two different methods. The first is rule-based and relies on stop word frequency, while the second is statically-based and uses several machine learning classifiers. Obtained results show that the statistical approach outperforms the rule-based approach. Furthermore, the Support Vector Machines classifier is more accurate than other statistical classifiers. Our goal in this paper is to pave the way toward building advanced Moroccan dialect NLP tools such as morphological analyzer and machine translation system.

[1]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[2]  Fatiha Sadat,et al.  Automatic Identification of Arabic Language Varieties and Dialects in Social Media , 2014, SocialNLP@COLING.

[3]  Eric Atwell,et al.  Arabic Language WEKA-Based Dialect Classifier for Arabic Automatic Speech Recognition Transcripts , 2016, VarDial@COLING.

[4]  Chris Callison-Burch,et al.  The Arabic Online Commentary Dataset: an Annotated Dataset of Informal Arabic with High Dialectal Content , 2011, ACL.

[5]  Carol Peters,et al.  Multilingual Information Retrieval , 2012, Springer Berlin Heidelberg.

[6]  Shervin Malmasi,et al.  Arabic Dialect Identification Using a Parallel Multidialectal Corpus , 2015, PACLING.

[7]  Kemal Oflazer,et al.  A Multidialectal Parallel Corpus of Arabic , 2014, LREC.

[8]  Eibe Frank,et al.  Introducing Machine Learning Concepts with WEKA , 2016, Statistical Genomics.

[9]  Mona T. Diab,et al.  AIDA: Identifying Code Switching in Informal Arabic Text , 2014, CodeSwitch@EMNLP.

[10]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[11]  Karim Bouzoubaa,et al.  Lexical differences and similarities between Moroccan dialect and Arabic , 2016, 2016 4th IEEE International Colloquium on Information Science and Technology (CiSt).

[12]  Nizar Habash,et al.  MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic , 2014, LREC.

[13]  Mona T. Diab,et al.  Sentence Level Dialect Identification in Arabic , 2013, ACL.

[14]  Simon Dobnik,et al.  Identification of Languages in Algerian Arabic Multilingual Documents , 2017, WANLP@EACL.

[15]  Yonatan Belinkov,et al.  A Character-level Convolutional Neural Network for Distinguishing Similar Languages and Dialects , 2016, VarDial@COLING.

[16]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..