Towards Automatic Normalization of the Moroccan Dialectal Arabic User Generated Text

Today social media is an important way of communication between people in the world. As the case of other countries, Moroccan people use several languages in their web communication leaving behind a considerable amount of user-generated text. The latter presents several opportunities for extracting useful information. However, processing this content is very challenging especially when facing the Moroccan Dialectal Arabic content in social media. This is to several factors such as scripts diversity (Arabic and Arabizi), orthographic errors and writing rules lack. In this context, the present work is a first attempt towards addressing the problem of Moroccan Dialectal Arabic spelling inconsistency in social media. We conduct a deep study that uses a systematic approach where we report on a series of experiments performed on Moroccan Dialectal social media text. The most interesting findings that have emerged is the orthographic inconsistency existing in written Moroccan Dialectal Arabic regarding both Arabic and Latin scripts. This phenomenon affects an important amount of texts in social media and proved the need of exploiting available Arabic tools in addition to building a customized spelling correction system.

[1]  Nizar Habash,et al.  A Conventional Orthography for Algerian Arabic , 2015, ANLP@ACL.

[2]  Karim Bouzoubaa,et al.  Automatic Identification of Moroccan Colloquial Arabic , 2017, ICALP.

[3]  Karim Bouzoubaa,et al.  The Development of a Standard Morpho-Syntactic Lexicon for Arabic NLP , 2018, LOPAL '18.

[4]  Nizar Habash,et al.  A Conventional Orthography for Tunisian Arabic , 2014, LREC.

[5]  Nizar Habash,et al.  MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic , 2014, LREC.

[6]  Kemal Oflazer,et al.  Large Scale Arabic Error Annotation: Guidelines and Framework , 2014, LREC.

[7]  Kamel Smaïli,et al.  An empirical study of the Algerian dialect of Social network , 2017 .

[8]  Alexis Amid Neme A fully inflected Arabic verb resource constructed constructed from a lexicon of lemmas by using finite-state transducers* , 2013 .

[9]  Karim Bouzoubaa,et al.  Lexical differences and similarities between Moroccan dialect and Arabic , 2016, 2016 4th IEEE International Colloquium on Information Science and Technology (CiSt).

[10]  Nizar Habash,et al.  Conventional Orthography for Dialectal Arabic , 2012, LREC.

[11]  Younes Jaafar,et al.  Arabic Natural Language Processing from Software Engineering to Complex Pipeline , 2015, 2015 First International Conference on Arabic Computational Linguistics (ACLing).

[12]  Nadir Durrani,et al.  Farasa: A Fast and Furious Segmenter for Arabic , 2016, NAACL.

[13]  Jun Wang,et al.  Solving the Multiple-Instance Problem: A Lazy Learning Approach , 2000, ICML.

[14]  Nizar Habash,et al.  Processing Spontaneous Orthography , 2013, NAACL.

[15]  Karim Bouzoubaa,et al.  An Empirical Analysis of Moroccan Dialectal User-Generated Text , 2019, ICCCI.

[16]  Philippe Blache,et al.  Spoken Tunisian Arabic Corpus "STAC": Transcription and Annotation , 2015, Res. Comput. Sci..