Automatic Identification of Arabic Language Varieties and Dialects in Social Media

Modern Standard Arabic (MSA) is the formal language in most Arabic countries. Arabic Dialects (AD) or daily language differs from MSA especially in social media communication. However, most Arabic social media texts have mixed forms and many variations especially between MSA and AD. This paper aims to bridge the gap between MSA and AD by providing a framework for AD classification using probabilistic models across social media datasets. We present a set of experiments using the character n-gram Markov language model and Naive Bayes classifiers with detailed examination of what models perform best under different conditions in social media context. Experimental results show that Naive Bayes classifier based on character bi-gram model can identify the 18 different Arabic dialects with a considerable overall accuracy of 98%.

[1]  Fawzi Suliman Alorifi,et al.  Automatic Identification of Arabic Dialects USING Hidden Markov Models , 2008 .

[2]  Nizar Habash,et al.  Spoken Arabic Dialect Identification Using Phonotactic Modeling , 2009, SEMITIC@EACL.

[3]  Mona T. Diab,et al.  Simplified guidelines for the creation of Large Scale Dialectal Arabic Annotations , 2012, LREC.

[4]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[5]  Timothy Baldwin,et al.  Language Identification: The Long and the Short of the Matter , 2010, NAACL.

[6]  Nizar Habash,et al.  MAGEAD: A Morphological Analyzer and Generator for the Arabic Dialects , 2006, ACL.

[7]  Gen-ichiro Kikui,et al.  Identifying the Coding System and Language of On-line Documents on the Internet , 1996, COLING.

[8]  Ted E. Dunning,et al.  Statistical Identification of Language , 1994 .

[9]  Mona T. Diab,et al.  CODACT: Towards Identifying Orthographic Variants in Dialectal Arabic , 2011, IJCNLP.

[10]  Chris Callison-Burch,et al.  The Arabic Online Commentary Dataset: an Annotated Dataset of Informal Arabic with High Dialectal Content , 2011, ACL.

[11]  Dale Schuurmans,et al.  Combining Naive Bayes and n-Gram Language Models for Text Classification , 2003, ECIR.

[12]  Dimitra Vergyri,et al.  Cross-dialectal acoustic data sharing for Arabic speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  Jörg Tiedemann,et al.  Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12) , 2012 .

[14]  Nizar Habash,et al.  Conventional Orthography for Dialectal Arabic , 2012, LREC.

[15]  Nizar Habash,et al.  Dialectal to Standard Arabic Paraphrasing to Improve Arabic-English Statistical Machine Translation , 2011, EMNLP 2011.

[16]  Philippe Langlais,et al.  Translating Government Agencies’ Tweet Feeds: Specificities, Problems and (a few) Solutions , 2013 .

[17]  K. Almeman,et al.  Automatic building of Arabic multi dialect text corpora by bootstrapping dialect words , 2013, 2013 1st International Conference on Communications, Signal Processing, and their Applications (ICCSPA).

[18]  Mona T. Diab,et al.  Sentence Level Dialect Identification in Arabic , 2013, ACL.

[19]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[20]  Roxana Girju,et al.  YADAC: Yet another Dialectal Arabic Corpus , 2012, LREC.