Creation of annotated country-level dialectal Arabic resources: An unsupervised approach

The wide usage of multiple spoken Arabic dialects on social networking sites stimulates increasing interest in Natural Language Processing (NLP) for dialectal Arabic (DA). Arabic dialects represent true linguistic diversity and differ from modern standard Arabic (MSA). In fact, the complexity and variety of these dialects make it insufficient to build one NLP system that is suitable for all of them. In comparison with MSA, the available datasets for various dialects are generally limited in terms of size, genre and scope. In this article, we present a novel approach that automatically develops an annotated country-level dialectal Arabic corpus and builds lists of words that encompass 15 Arabic dialects. The algorithm uses an iterative procedure consisting of two main components: automatic creation of lists for dialectal words and automatic creation of annotated Arabic dialect identification corpus. To our knowledge, our study is the first of its kind to examine and analyse the poor performance of the MSA part-of-speech tagger on dialectal Arabic contents and to exploit that in order to extract the dialectal words. The pointwise mutual information association measure and the geographical frequency of word occurrence online are used to classify dialectal words. The annotated dialectal Arabic corpus (Twt15DA), built using our algorithm, is collected from Twitter and consists of 311,785 tweets containing 3,858,459 words in total. We randomly selected a sample of 75 tweets per country, 1125 tweets in total, and conducted a manual dialect identification task by native speakers. The results show an average inter-annotator agreement score equal to 64%, which reflects satisfactory agreement considering the overlapping features of the 15 Arabic dialects.

[1]  Kemal Oflazer,et al.  YouDACC: the Youtube Dialectal Arabic Comment Corpus , 2014, LREC.

[2]  Hend Suliman Al-Khalifa,et al.  AraSenTi-Tweet: A Corpus for Arabic Sentiment Analysis of Saudi Tweets , 2017, ACLING.

[3]  Chris Callison-Burch,et al.  The Arabic Online Commentary Dataset: an Annotated Dataset of Informal Arabic with High Dialectal Content , 2011, ACL.

[4]  Owen Rambow,et al.  DIWAN: A Dialectal Word Annotation Tool for Arabic , 2015, ANLP@ACL.

[5]  Hadhemi Achour,et al.  Constructing Linguistic Resources for the Tunisian Dialect Using Textual User-Generated Contents on the Social Web , 2015, ICWE Workshops.

[6]  Lamia Hadrich Belguith,et al.  Building bilingual lexicon to create Dialect Tunisian corpora and adapt language model , 2013, HyTra@ACL.

[7]  Fei Huang Improved Arabic Dialect Classification with Social Media Data , 2015, EMNLP.

[8]  Kareem Darwish,et al.  Using Twitter to Collect a Multi-Dialectal Corpus of Arabic , 2014, ANLP@EMNLP.

[9]  Kemal Oflazer,et al.  The MADAR Arabic Dialect Corpus and Lexicon , 2018, LREC.

[10]  Houda Bouamor,et al.  Fine-Grained Arabic Dialect Identification , 2018, COLING.

[11]  Nizar Habash,et al.  A Conventional Orthography for Algerian Arabic , 2015, ANLP@ACL.

[12]  Diglossia , 2019, The SAGE Encyclopedia of Human Communication Sciences and Disorders.

[13]  Samhaa R. El-Beltagy,et al.  NileULex: A Phrase and Word Level Sentiment Lexicon for Egyptian and Modern Standard Arabic , 2016, LREC.

[14]  Nizar Habash,et al.  Curras: an annotated corpus for the Palestinian Arabic dialect , 2017, Lang. Resour. Evaluation.

[15]  Wajdi Zaghouani,et al.  Arap-Tweet: A Large Multi-Dialect Twitter Corpus for Gender, Age and Language Variety Identification , 2018, LREC.

[16]  Fatiha Sadat,et al.  Automatic identification of arabic dialects in social media , 2014, SoMeRA@SIGIR.

[17]  Nizar Habash,et al.  Arabic Dialect Processing Tutorial , 2012, HLT-NAACL.

[18]  Udo Kruschwitz,et al.  Automatic Creation of Arabic Named Entity Annotated Corpus Using Wikipedia , 2014, EACL.

[19]  Dimitar Kazakov,et al.  Building Dialectal Arabic Corpora , 2017 .

[20]  Muhammad Abdul-Mageed,et al.  SANA: A Large Scale Multi-Genre, Multi-Dialect Lexicon for Arabic Subjectivity and Sentiment Analysis , 2014, LREC.

[21]  Eiichiro Sumita,et al.  Multilingual Spoken Language Corpus Development for Communication Research , 2006, ROCLING/IJCLCLP.

[22]  Alexander Erdmann,et al.  Unified Guidelines and Resources for Arabic Dialect Orthography , 2018, LREC.

[23]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[24]  Tamer Elsayed,et al.  DART: A Large Dataset of Dialectal Arabic Tweets , 2018, LREC.

[25]  Iadh Ounis,et al.  On building a reusable Twitter corpus , 2012, SIGIR '12.

[26]  Stefan Riezler,et al.  Twitter Translation using Translation-Based Cross-Lingual Retrieval , 2012, WMT@NAACL-HLT.

[27]  Paolo Rosso,et al.  On the evaluation and improvement of Arabic WordNet coverage and usability , 2013, Language Resources and Evaluation.

[28]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[29]  Christopher D. Manning,et al.  Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[30]  Kevin Duh,et al.  POS Tagging of Dialectal Arabic: A Minimally Supervised Approach , 2005, SEMITIC@ACL.

[31]  Nizar Habash,et al.  Tharwa: A Large Scale Dialectal Arabic - Standard Arabic - English Lexicon , 2014, LREC.

[32]  Nizar Habash,et al.  Morphologically Annotated Corpora and Morphological Analyzers for Moroccan and Sanaani Yemeni Arabic , 2016, LREC.

[33]  Sherif Abdou,et al.  MIKA: A tagged corpus for modern standard Arabic and colloquial sentiment analysis , 2015, 2015 IEEE 2nd International Conference on Recent Trends in Information Systems (ReTIS).

[34]  Edward H. Adelson,et al.  Crisp Boundary Detection Using Pointwise Mutual Information , 2014, ECCV.

[35]  Nizar Habash,et al.  Developing an Egyptian Arabic Treebank: Impact of Dialectal Morphology on Annotation and Tool Development , 2014, LREC.

[36]  Nizar Habash,et al.  The MADAR Shared Task on Arabic Fine-Grained Dialect Identification , 2019, WANLP@ACL 2019.

[37]  Salem Ghazali,et al.  Speech Rhythm Variation in Arabic Dialects , 2002 .

[38]  T. V. D. Cruys Two multivariate generalizations of pointwise mutual information , 2011 .

[39]  Nizar Habash,et al.  Developing and Using a Pilot Dialectal Arabic Treebank , 2006, LREC.

[40]  Laura Kallmeyer,et al.  A Neural Architecture for Dialectal Arabic Segmentation , 2017, WANLP@EACL.

[41]  Abdulhadi Shoufan,et al.  Natural Language Processing for Dialectical Arabic: A Survey , 2015, ANLP@ACL.

[42]  Nizar Habash,et al.  Morphological Analysis and Disambiguation for Dialectal Arabic , 2013, NAACL.

[43]  Nizar Habash,et al.  NADI 2020: The First Nuanced Arabic Dialect Identification Shared Task , 2020, WANLP.

[44]  John J. Ohala,et al.  Prosody as a distinctive feature for the discrimination of arabic dialects , 1999, EUROSPEECH.

[45]  Philippe Blache,et al.  Spoken Tunisian Arabic Corpus "STAC": Transcription and Annotation , 2015, Res. Comput. Sci..

[46]  Mona T. Diab,et al.  Token Level Identification of Linguistic Code Switching , 2012, COLING.

[47]  Kamel Smaïli,et al.  CALYOU: A Comparable Spoken Algerian Corpus Harvested from YouTube , 2017, INTERSPEECH.

[48]  K. Almeman,et al.  Automatic building of Arabic multi dialect text corpora by bootstrapping dialect words , 2013, 2013 1st International Conference on Communications, Signal Processing, and their Applications (ICCSPA).

[50]  Udo Kruschwitz,et al.  AraNLP: a Java-based Library for the Processing of Arabic Text , 2014, LREC.

[51]  Karima Meftouh,et al.  Machine Translation Experiments on PADIC: A Parallel Arabic DIalect Corpus , 2015, PACLIC.

[52]  Bilal Hawashin,et al.  Cyber-Bullying and Cyber-Harassment Detection Using Supervised Machine Learning Techniques in Arabic Social Media Contents , 2020 .

[53]  Philippe Blache,et al.  Sentence Boundary Detection for Transcribed Tunisian Arabic , 2016, KONVENS.

[54]  Shiwen Yu,et al.  Using Pointwise Mutual Information to Identify Implicit Features in Customer Reviews , 2006, ICCPOL.

[55]  Karima Meftouh,et al.  Cross-Dialectal Arabic Processing , 2015, CICLing.

[56]  Nizar Habash,et al.  50th Annual Meeting of the Association for Computational Linguistics Proceedings of the Conference Volume 2: Short Papers , 2012 .

[57]  Laura Kallmeyer,et al.  Arabic Multi-Dialect Segmentation: bi-LSTM-CRF vs. SVM , 2017, ArXiv.

[58]  Christiane Fellbaum,et al.  Introducing the Arabic WordNet project , 2006 .

[59]  Kemal Oflazer,et al.  A Multidialectal Parallel Corpus of Arabic , 2014, LREC.

[60]  Danda B. Rawat,et al.  Automatic Spam Detection on Gulf Dialectical Arabic Tweets , 2019, 2019 International Conference on Computing, Networking and Communications (ICNC).

[61]  Guosong Shao,et al.  Understanding the appeal of user-generated media: a uses and gratification perspective , 2009, Internet Res..

[62]  Maha J. Althobaiti,et al.  Automatic Arabic Dialect Identification Systems for Written Texts: A Survey , 2020, ArXiv.

[63]  Karima Meftouh,et al.  Diacritics Restoration for Arabic Dialects , 2013, Interspeech 2013.

[64]  Preslav Nakov,et al.  Discriminating between Similar Languages and Arabic Dialect Identification: A Report on the Third DSL Shared Task , 2016, VarDial@COLING.

[65]  Nizar Habash,et al.  A Large Scale Corpus of Gulf Arabic , 2016, LREC.

[66]  Rim Faiz,et al.  Tunisian dialect Wordnet creation and enrichment using web resources and other Wordnets , 2014, ANLP@EMNLP.

[67]  Chris Callison-Burch,et al.  Machine Translation of Arabic Dialects , 2012, NAACL.

[68]  Muhammad Abdul-Mageed,et al.  You Tweet What You Speak: A City-Level Dataset of Arabic Dialects , 2018, LREC.

[69]  Nizar Habash,et al.  MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic , 2014, LREC.

[70]  Laura Kallmeyer,et al.  Multi-Dialect Arabic POS Tagging: A CRF Approach , 2018, LREC.

[71]  Chris Callison-Burch,et al.  Arabic Dialect Identification , 2014, CL.

[72]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[73]  Anazida Zainal,et al.  CLASENTI , 2018, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[74]  Kareem Darwish,et al.  Arabizi Detection and Conversion to Arabic , 2013, ANLP@EMNLP.

[75]  Nizar Habash,et al.  A Morphologically Annotated Corpus of Emirati Arabic , 2018, LREC.

[76]  Fethi Bougares,et al.  Sentiment Analysis of Tunisian Dialects: Linguistic Ressources and Experiments , 2017, WANLP@EACL.

[77]  Ryan Cotterell,et al.  A Multi-Dialect, Multi-Genre Corpus of Informal Written Arabic , 2014, LREC.

[78]  Roxana Girju,et al.  YADAC: Yet another Dialectal Arabic Corpus , 2012, LREC.

[79]  Abdelmajid Ben Hamadou,et al.  Exploiting Emoticons to Generate Emotional Dictionaries from Facebook Pages , 2016 .

[80]  Stergios Chatzikyriakidis,et al.  Shami: A Corpus of Levantine Arabic Dialects , 2018, LREC.

[81]  Paolo Rosso,et al.  ARAP: Arabic Author Profiling Project for Cyber-Security , 2018, Proces. del Leng. Natural.

[82]  Yassine Benajiba,et al.  Arabic Named Entity Recognition using Optimized Feature Sets , 2008, EMNLP.

[83]  Muazzam Ahmed Siddiqui,et al.  Building A Sentiment Analysis Corpus With Multifaceted Hierarchical Annotation , 2015 .

[84]  Nizar Habash,et al.  On Arabic Transliteration , 2007 .