ARBERT & MARBERT: Deep Bidirectional Transformers for Arabic

Masked language models (MLM) have become an integral part of many natural language processing systems. Although multilingual MLMs have been introduced to serve many languages, these have limitations as to their capacity and the size and diversity of non-English data they are pre-trained on. In this work, we remedy these issues for Arabic by introducing two powerful deep bidirectional transformer-based models, ARBERT and MARBERT, that have superior performance to all existing models. To evaluate our models, we propose ArBench, a new benchmark for multi-dialectal Arabic language understanding. ArBench is built using 41 datasets targeting 5 different tasks/task clusters, allowing us to offer a series of standardized experiments under rich conditions. When fine-tuned on ArBench, ARBERT and MARBERT collectively achieve new SOTA with sizeable margins compared to all existing models such as mBERT, XLM-R (Base and Large), and AraBERT on 37 out of 45 classification tasks on the 41 datasets (%82.22). Our models are publicly available for research.

[1]  A. Elnagar,et al.  Hotel Arabic-Reviews Dataset Construction for Sentiment Analysis Applications , 2018 .

[2]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[3]  Benjamin Lecouteux,et al.  FlauBERT: Unsupervised Language Model Pre-training for French , 2020, LREC.

[4]  Michal Perelkiewicz,et al.  Pre-training Polish Transformer-based Language Models at Scale , 2020, ICAISC.

[5]  Walid Magdy,et al.  Overview of OSACT4 Arabic Offensive Language Detection Shared Task , 2020, OSACT.

[6]  Anna Rumshisky,et al.  A Primer in BERTology: What We Know About How BERT Works , 2020, Transactions of the Association for Computational Linguistics.

[7]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[8]  Richard Socher,et al.  The Natural Language Decathlon: Multitask Learning as Question Answering , 2018, ArXiv.

[9]  Deniz Yuret,et al.  KUISAIL at SemEval-2020 Task 12: BERT-CNN for Offensive Speech Identification in Social Media , 2020, SEMEVAL.

[10]  Muhammad Abdul-Mageed,et al.  Enabling Deep Learning of Emotion With First-Person Seed Expressions , 2018, PEOPLES@NAACL-HTL.

[11]  Thomas Eckart,et al.  OSIAN: Open Source International Arabic News Corpus - Preparation and Integration into the CLARIN-infrastructure , 2019, WANLP@ACL 2019.

[12]  Motaz Saad,et al.  OSAC: Open Source Arabic Corpora , 2010 .

[13]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[14]  Muhammad Abdul-Mageed,et al.  Understanding and Detecting Dangerous Speech in Social Media , 2020, OSACT.

[15]  Guokun Lai,et al.  RACE: Large-scale ReAding Comprehension Dataset From Examinations , 2017, EMNLP.

[16]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[17]  Kareem Darwish,et al.  Named Entity Recognition using Cross-lingual Resources: Arabic as an Example , 2013, ACL.

[18]  Colin Raffel,et al.  mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer , 2021, NAACL.

[19]  Muhammad Abdul-Mageed,et al.  SAMAR: Subjectivity and sentiment analysis for Arabic social media , 2014, Comput. Speech Lang..

[20]  Dat Quoc Nguyen,et al.  PhoBERT: Pre-trained language models for Vietnamese , 2020, Findings.

[21]  Nizar Habash,et al.  The MADAR Shared Task on Arabic Fine-Grained Dialect Identification , 2019, WANLP@ACL 2019.

[22]  Saif Mohammad,et al.  Sentiment after Translation: A Case-Study on Arabic Social Media Posts , 2015, NAACL.

[23]  Saif Mohammad,et al.  SemEval-2018 Task 1: Affect in Tweets , 2018, *SEMEVAL.

[24]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[25]  Hazem M. Hajj,et al.  Multi-Task Learning using AraBert for Offensive Language Detection , 2020, OSACT.

[26]  Ahmed Abdelali,et al.  ALT Submission for OSACT Shared Task on Offensive Language Detection , 2020, OSACT.

[27]  Fatemah Husain,et al.  OSACT4 Shared Task on Offensive Language Detection: Intensive Preprocessing-Based Approach , 2020, OSACT.

[28]  Muhammad Abdul-Mageed,et al.  Deep Models for Arabic Dialect Identification on Benchmarked Data , 2018, VarDial@COLING 2018.

[29]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[30]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[31]  Ibraheem Tuffaha,et al.  Multi-dialect Arabic BERT for Country-level Dialect Identification , 2020, WANLP.

[32]  Yan Xu,et al.  Incorporating Word and Subword Units in Unsupervised Machine Translation Using Language Model Rescoring , 2019, WMT.

[33]  Muhammad Abdul-Mageed,et al.  AWATIF: A Multi-Genre Corpus for Modern Standard Arabic Subjectivity and Sentiment Analysis , 2012, LREC.

[34]  Martin Malmsten,et al.  Playing with Words at the National Library of Sweden - Making a Swedish BERT , 2020, ArXiv.

[35]  Hazem Hajj,et al.  AraBERT: Transformer-based Model for Arabic Language Understanding , 2020, OSACT.

[36]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[37]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[38]  Mike Schuster,et al.  Japanese and Korean voice search , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[40]  AbdelRahim A. Elmadany,et al.  ArSAS : An Arabic Speech-Act and Sentiment Corpus of Tweets , 2018 .

[41]  Muhammad Abdul-Mageed,et al.  Machine Generation and Detection of Arabic Manipulated and Fake News , 2020, WANLP.

[42]  Nadir Durrani,et al.  Farasa: A Fast and Furious Segmenter for Arabic , 2016, NAACL.

[43]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[44]  Walid Magdy,et al.  From Arabic Sentiment Analysis to Sarcasm Detection: The ArSarcasm Dataset , 2020, OSACT.

[45]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[46]  Chris Callison-Burch,et al.  Arabic Dialect Identification , 2014, CL.

[47]  Murat Can Ganiz,et al.  Semantic text classification: A survey of past and recent advances , 2018, Inf. Process. Manag..

[48]  Roland Vollgraf,et al.  FLAIR: An Easy-to-Use Framework for State-of-the-Art NLP , 2019, NAACL.

[49]  Sampo Pyysalo,et al.  WikiBERT Models: Deep Transfer Learning for Many Languages , 2020, NODALIDA.

[50]  Tomas Mikolov,et al.  Advances in Pre-Training Distributed Word Representations , 2017, LREC.

[51]  Amir F. Atiya,et al.  LABR: A Large Scale Arabic Book Reviews Dataset , 2013, ACL.

[52]  Muhammad Abdul-Mageed,et al.  Multi-Task Bidirectional Transformer Representations for Irony Detection , 2019, FIRE.

[53]  Hend Suliman Al-Khalifa,et al.  AraSenTi-Tweet: A Corpus for Arabic Sentiment Analysis of Saudi Tweets , 2017, ACLING.

[54]  Khaled Shaalan,et al.  Arabic Tweets Sentimental Analysis Using Machine Learning , 2017, IEA/AIE.

[55]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[56]  Paolo Rosso,et al.  IDAT at FIRE2019: Overview of the Track on Irony Detection in Arabic Tweets , 2019, FIRE.

[57]  Yassine Benajiba,et al.  ANERsys 2.0: Conquering the NER Task for the Arabic Language by Combining the Maximum Entropy with POS-tag Information , 2007, IICAI.

[58]  Khaled Shaalan,et al.  Self-Training Pre-Trained Language Models for Zero- and Few-Shot Multi-Dialectal Arabic Sequence Labeling , 2021, EACL.

[59]  Laurent Romary,et al.  CamemBERT: a Tasty French Language Model , 2019, ACL.

[60]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[61]  Amir F. Atiya,et al.  ASTD: Arabic Sentiment Tweets Dataset , 2015, EMNLP.

[62]  Sebastian Riedel,et al.  MLQA: Evaluating Cross-lingual Extractive Question Answering , 2019, ACL.

[63]  Alexander Erdmann,et al.  CAMeL Tools: An Open Source Python Toolkit for Arabic Natural Language Processing , 2020, LREC.

[64]  Ahmed Khoumsi,et al.  Weighted combination of BERT and N-GRAM features for Nuanced Arabic Dialect Identification , 2020, WANLP.

[65]  Kemal Oflazer,et al.  The MADAR Arabic Dialect Corpus and Lexicon , 2018, LREC.

[66]  Roland Vollgraf,et al.  Contextual String Embeddings for Sequence Labeling , 2018, COLING.

[67]  Nizar Habash,et al.  NADI 2021: The Second Nuanced Arabic Dialect Identification Shared Task , 2021, WANLP.

[68]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .

[69]  Hazem M. Hajj,et al.  ArSentD-LEV: A Multi-Topic Corpus for Target-based Sentiment Analysis in Arabic Levantine Tweets , 2019, ArXiv.

[70]  Graham Neubig,et al.  XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization , 2020, ICML.

[71]  Tapio Salakoski,et al.  Multilingual is not enough: BERT for Finnish , 2019, ArXiv.

[72]  Khaled Shaalan,et al.  Character convolutions for Arabic Named Entity Recognition with Long Short-Term Memory Networks , 2019, Comput. Speech Lang..

[73]  Tommaso Caselli,et al.  BERTje: A Dutch BERT Model , 2019, ArXiv.

[74]  Wajdi Zaghouani,et al.  Arap-Tweet: A Large Multi-Dialect Twitter Corpus for Gender, Age and Language Variety Identification , 2018, LREC.

[75]  Saif Mohammad,et al.  SemEval-2016 Task 7: Determining Sentiment Intensity of English and Arabic Phrases , 2016, *SEMEVAL.

[76]  Ahmed Abdelali,et al.  QADI: Arabic Dialect Identification in the Wild , 2020, WANLP.

[77]  Benoît Sagot,et al.  Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures , 2019 .

[78]  Walid Magdy,et al.  Mazajak: An Online Arabic Sentiment Analyser , 2019, WANLP@ACL 2019.

[79]  Muhammad Abdul-Mageed,et al.  AraNet: A Deep Learning Toolkit for Arabic Social Media , 2020, OSACT.

[80]  Nizar Habash,et al.  NADI 2020: The First Nuanced Arabic Dialect Identification Shared Task , 2020, WANLP.

[81]  Muhammad Abdul-Mageed,et al.  No Army, No Navy: BERT Semi-Supervised Learning of Arabic Dialects , 2019, WANLP@ACL 2019.

[82]  Hazem M. Hajj,et al.  Neural Arabic Question Answering , 2019, WANLP@ACL 2019.

[83]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[84]  Bilel Elayeb,et al.  ANT Corpus: An Arabic News Text Collection for Textual Classification , 2017, 2017 IEEE/ACS 14th International Conference on Computer Systems and Applications (AICCSA).

[85]  Preslav Nakov,et al.  SemEval-2016 Task 4: Sentiment Analysis in Twitter , 2016, *SEMEVAL.

[86]  Kamel Smaïli,et al.  Evaluation of Topic Identification Methods on Arabic Corpora , 2011, J. Digit. Inf. Manag..