NADI 2021: The Second Nuanced Arabic Dialect Identification Shared Task

We present the findings and results of theSecond Nuanced Arabic Dialect IdentificationShared Task (NADI 2021). This Shared Taskincludes four subtasks: country-level ModernStandard Arabic (MSA) identification (Subtask1.1), country-level dialect identification (Subtask1.2), province-level MSA identification (Subtask2.1), and province-level sub-dialect identifica-tion (Subtask 2.2). The shared task dataset cov-ers a total of 100 provinces from 21 Arab coun-tries, collected from the Twitter domain. A totalof 53 teams from 23 countries registered to par-ticipate in the tasks, thus reflecting the interestof the community in this area. We received 16submissions for Subtask 1.1 from five teams, 27submissions for Subtask 1.2 from eight teams,12 submissions for Subtask 2.1 from four teams,and 13 Submissions for subtask 2.2 from fourteams.

[1]  Muhammad Abdul-Mageed,et al.  You Tweet What You Speak: A City-Level Dataset of Arabic Dialects , 2018, LREC.

[2]  Karima Meftouh,et al.  Building resources for Algerian Arabic dialects , 2014, INTERSPEECH.

[3]  Preslav Nakov,et al.  Discriminating between Similar Languages and Arabic Dialect Identification: A Report on the Third DSL Shared Task , 2016, VarDial@COLING.

[4]  Clive Holes,et al.  Modern Arabic: Structures, Functions, and Varieties , 1996 .

[5]  K. Brustad The Syntax of Spoken Arabic: A Comparative Study of Moroccan, Egyptian, Syrian, and Kuwaiti Dialects. , 2002 .

[6]  Country-level Arabic Dialect Identification Using Small Datasets with Integrated Machine Learning Techniques and Deep Learning Models , 2021, WANLP.

[7]  Walid Magdy,et al.  From Arabic Sentiment Analysis to Sarcasm Detection: The ArSarcasm Dataset , 2020, OSACT.

[8]  Nizar Habash,et al.  A Large Scale Corpus of Gulf Arabic , 2016, LREC.

[9]  Muhammad Abdul-Mageed,et al.  Enabling Deep Learning of Emotion With First-Person Seed Expressions , 2018, PEOPLES@NAACL-HTL.

[10]  Walid Magdy,et al.  Overview of OSACT4 Arabic Offensive Language Detection Shared Task , 2020, OSACT.

[11]  Badr AlKhamissi,et al.  Adapting MARBERT for Improved Arabic Dialect Identification: Submission to the NADI 2021 Shared Task , 2021, WANLP.

[12]  Muhammad Abdul-Mageed,et al.  ARBERT & MARBERT: Deep Bidirectional Transformers for Arabic , 2020, ACL.

[13]  Preslav Nakov,et al.  Language Identification and Morphosyntactic Tagging: The Second VarDial Evaluation Campaign , 2018, VarDial@COLING 2018.

[14]  Kareem Darwish,et al.  Using Twitter to Collect a Multi-Dialectal Corpus of Arabic , 2014, ANLP@EMNLP.

[15]  Richard S. Harrell A short reference grammar of Moroccan Arabic : with audio CD , 2004 .

[16]  Muhammad Abdul-Mageed,et al.  SAMAR: Subjectivity and sentiment analysis for Arabic social media , 2014, Comput. Speech Lang..

[17]  Chris Callison-Burch,et al.  The Arabic Online Commentary Dataset: an Annotated Dataset of Informal Arabic with High Dialectal Content , 2011, ACL.

[18]  Kemal Oflazer,et al.  The MADAR Arabic Dialect Corpus and Lexicon , 2018, LREC.

[19]  Mona Attariyan,et al.  Parameter-Efficient Transfer Learning for NLP , 2019, ICML.

[20]  Nizar Habash,et al.  ADIDA: Automatic Dialect Identification for Arabic , 2019, NAACL.

[21]  Ines Abbes,et al.  DAICT: A Dialectal Arabic Irony Corpus Extracted from Twitter , 2020, LREC.

[22]  Nizar Habash,et al.  The MADAR Shared Task on Arabic Fine-Grained Dialect Identification , 2019, WANLP@ACL 2019.

[23]  Wajdi Zaghouani,et al.  Arap-Tweet: A Large Multi-Dialect Twitter Corpus for Gender, Age and Language Variety Identification , 2018, LREC.

[24]  Mark W. Cowell A Reference Grammar of Syrian Arabic , 1964 .

[25]  Roxana Girju,et al.  YADAC: Yet another Dialectal Arabic Corpus , 2012, LREC.

[26]  Fatiha Sadat,et al.  Automatic Identification of Arabic Language Varieties and Dialects in Social Media , 2014, SocialNLP@COLING.

[27]  Houda Bouamor,et al.  Fine-Grained Arabic Dialect Identification , 2018, COLING.

[28]  Mona T. Diab,et al.  AIDA: Identifying Code Switching in Informal Arabic Text , 2014, CodeSwitch@EMNLP.

[29]  Hazem Hajj,et al.  AraELECTRA: Pre-Training Text Discriminators for Arabic Language Understanding , 2020, ArXiv.

[30]  Nizar Habash,et al.  NADI 2020: The First Nuanced Arabic Dialect Identification Shared Task , 2020, WANLP.

[31]  Nizar Habash,et al.  Curras: an annotated corpus for the Palestinian Arabic dialect , 2017, Lang. Resour. Evaluation.

[32]  Nora Al-Twairesh,et al.  SUAR: Towards Building a Corpus for the Saudi Dialect , 2018, ACLING.

[33]  Maha J. Althobaiti,et al.  Automatic Arabic Dialect Identification Systems for Written Texts: A Survey , 2020, ArXiv.

[34]  Hazem Hajj,et al.  AraBERT: Transformer-based Model for Arabic Language Understanding , 2020, OSACT.

[35]  Anshul Wadhawan,et al.  Dialect Identification in Nuanced Arabic Tweets Using Farasa Segmentation and AraBERT , 2021, WANLP.

[36]  Mona T. Diab,et al.  COLABA : Arabic Dialect Annotation and Processing , 2011 .

[37]  Reda Al-Bahrani,et al.  Country-level Arabic Dialect Identification using RNNs with and without Linguistic Features , 2021, WANLP.

[38]  Preslav Nakov,et al.  Findings of the VarDial Evaluation Campaign 2017 , 2017, VarDial.

[39]  Kemal Oflazer,et al.  A Multidialectal Parallel Corpus of Arabic , 2014, LREC.

[40]  Mahmoud El-Haj,et al.  Habibi - a multi Dialect multi National Arabic Song Lyrics Corpus , 2020, LREC.

[41]  Karima Meftouh,et al.  Machine Translation Experiments on PADIC: A Parallel Arabic DIalect Corpus , 2015, PACLIC.