Multi-dialect Arabic BERT for Country-level Dialect Identification

Arabic dialect identification is a complex problem for a number of inherent properties of the language itself. In this paper, we present the experiments conducted, and the models developed by our competing team, Mawdoo3 AI, along the way to achieving our winning solution to subtask 1 of the Nuanced Arabic Dialect Identification (NADI) shared task. The dialect identification subtask provides 21,000 country-level labeled tweets covering all 21 Arab countries. An unlabeled corpus of 10M tweets from the same domain is also presented by the competition organizers for optional use. Our winning solution itself came in the form of an ensemble of different training iterations of our pre-trained BERT model, which achieved a micro-averaged F1-score of 26.78% on the subtask at hand. We publicly release the pre-trained language model component of our winning solution under the name of Multi-dialect-Arabic-BERT model, for any interested researcher out there.

[1]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[2]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[3]  Hussein T. Al-Natsheh,et al.  Mawdoo3 AI at MADAR Shared Task: Arabic Fine-Grained Dialect Identification with Ensemble Learning , 2019, WANLP@ACL 2019.

[4]  Wajdi Zaghouani,et al.  Arap-Tweet: A Large Multi-Dialect Twitter Corpus for Gender, Age and Language Variety Identification , 2018, LREC.

[5]  Samhaa R. El-Beltagy,et al.  AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP , 2017, ACLING.

[6]  Chris Callison-Burch,et al.  Arabic Dialect Identification , 2014, CL.

[7]  Fei Huang Improved Arabic Dialect Classification with Social Media Data , 2015, EMNLP.

[8]  Nizar Habash,et al.  NADI 2020: The First Nuanced Arabic Dialect Identification Shared Task , 2020, WANLP.

[9]  Muhammad Abdul-Mageed,et al.  You Tweet What You Speak: A City-Level Dataset of Arabic Dialects , 2018, LREC.

[10]  Vladimir Zolotov,et al.  Analysis and Optimization of fastText Linear Text Classifier , 2017, ArXiv.

[11]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[12]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[13]  Nizar Habash,et al.  The MADAR Shared Task on Arabic Fine-Grained Dialect Identification , 2019, WANLP@ACL 2019.

[14]  Kemal Oflazer,et al.  The MADAR Arabic Dialect Corpus and Lexicon , 2018, LREC.

[15]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[17]  Philipp Koehn,et al.  Scalable Modified Kneser-Ney Language Model Estimation , 2013, ACL.

[18]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[19]  Mohamed Ali,et al.  Character Level Convolutional Neural Network for Arabic Dialect Identification , 2018, VarDial@COLING 2018.

[20]  Muazzam Ahmed Siddiqui,et al.  Pre-trained Word Embeddings for Arabic Aspect-Based Sentiment Analysis of Airline Tweets , 2018, AISI.

[21]  Mona T. Diab,et al.  Sentence Level Dialect Identification in Arabic , 2013, ACL.

[22]  Deniz Yuret,et al.  KUISAIL at SemEval-2020 Task 12: BERT-CNN for Offensive Speech Identification in Social Media , 2020, SEMEVAL.

[23]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[24]  Houda Bouamor,et al.  Fine-Grained Arabic Dialect Identification , 2018, COLING.

[25]  Muhammad Abdul-Mageed,et al.  No Army, No Navy: BERT Semi-Supervised Learning of Arabic Dialects , 2019, WANLP@ACL 2019.

[26]  Mahmoud Al-Ayyoub,et al.  Team JUST at the MADAR Shared Task on Arabic Fine-Grained Dialect Identification , 2019, WANLP@ACL 2019.

[27]  Muhammad Abdul-Mageed,et al.  Deep Models for Arabic Dialect Identification on Benchmarked Data , 2018, VarDial@COLING 2018.

[28]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[29]  Hussein T. Al-Natsheh,et al.  Mawdoo3 AI at MADAR Shared Task: Arabic Tweet Dialect Identification , 2019, WANLP@ACL 2019.