Weighted combination of BERT and N-GRAM features for Nuanced Arabic Dialect Identification

Around the Arab world, different Arabic dialects are spoken by more than 300M persons, and are increasingly popular in social media texts. However, Arabic dialects are considered to be low-resource languages, limiting the development of machine-learning based systems for these dialects. In this paper, we investigate the Arabic dialect identification task, from two perspectives: country-level dialect identification from 21 Arab countries, and province-level dialect identification from 100 provinces. We introduce an unified pipeline of state-of-the-art models, that can handle the two subtasks. Our experimental studies applied to the NADI shared task under the team name BERT-NGRAMS, show promising results both at the country-level (F1-score of 25.99%) and the province-level (F1-score of 6.39%), and thus allow us to be ranked 2nd for the country-level subtask, and 1st in the province-level subtask.

[1]  Mike Schuster,et al.  Japanese and Korean voice search , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[3]  Tong Zhang,et al.  Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[4]  Hazem Hajj,et al.  AraBERT: Transformer-based Model for Arabic Language Understanding , 2020, OSACT.

[5]  Nizar Habash,et al.  The MADAR Shared Task on Arabic Fine-Grained Dialect Identification , 2019, WANLP@ACL 2019.

[6]  Hussein T. Al-Natsheh,et al.  Mawdoo3 AI at MADAR Shared Task: Arabic Tweet Dialect Identification , 2019, WANLP@ACL 2019.

[7]  Nizar Habash,et al.  NADI 2020: The First Nuanced Arabic Dialect Identification Shared Task , 2020, WANLP.

[8]  Muhammad Abdul-Mageed,et al.  No Army, No Navy: BERT Semi-Supervised Learning of Arabic Dialects , 2019, WANLP@ACL 2019.

[9]  Abdessamad Benlahbib,et al.  LISAC FSDM-USMBA Team at SemEval-2020 Task 12: Overcoming AraBERT’s pretrain-finetune discrepancy for Arabic offensive language identification , 2020, SEMEVAL.

[10]  Nizar Habash,et al.  On Arabic Transliteration , 2007 .

[11]  Mourad Abbas,et al.  ST MADAR 2019 Shared Task: Arabic Fine-Grained Dialect Identification , 2019, WANLP@ACL 2019.

[12]  Nadir Durrani,et al.  Farasa: A Fast and Furious Segmenter for Arabic , 2016, NAACL.

[13]  Nizar Habash,et al.  Conventional Orthography for Dialectal Arabic , 2012, LREC.

[14]  Chris Callison-Burch,et al.  Arabic Dialect Identification , 2014, CL.

[15]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[16]  Anshul Mittal,et al.  Stock Prediction Using Twitter Sentiment Analysis , 2011 .

[17]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.