论文信息 - ARBERT & MARBERT: Deep Bidirectional Transformers for Arabic

ARBERT & MARBERT: Deep Bidirectional Transformers for Arabic

Masked language models (MLM) have become an integral part of many natural language processing systems. Although multilingual MLMs have been introduced to serve many languages, these have limitations as to their capacity and the size and diversity of non-English data they are pre-trained on. In this work, we remedy these issues for Arabic by introducing two powerful deep bidirectional transformer-based models, ARBERT and MARBERT, that have superior performance to all existing models. To evaluate our models, we propose ArBench, a new benchmark for multi-dialectal Arabic language understanding. ArBench is built using 41 datasets targeting 5 different tasks/task clusters, allowing us to offer a series of standardized experiments under rich conditions. When fine-tuned on ArBench, ARBERT and MARBERT collectively achieve new SOTA with sizeable margins compared to all existing models such as mBERT, XLM-R (Base and Large), and AraBERT on 37 out of 45 classification tasks on the 41 datasets (%82.22). Our models are publicly available for research.

[1] A. Elnagar,et al. Hotel Arabic-Reviews Dataset Construction for Sentiment Analysis Applications , 2018 .

[2] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[3] Benjamin Lecouteux,et al. FlauBERT: Unsupervised Language Model Pre-training for French , 2020, LREC.

[4] Michal Perelkiewicz,et al. Pre-training Polish Transformer-based Language Models at Scale , 2020, ICAISC.

[5] Walid Magdy,et al. Overview of OSACT4 Arabic Offensive Language Detection Shared Task , 2020, OSACT.

[6] Anna Rumshisky,et al. A Primer in BERTology: What We Know About How BERT Works , 2020, Transactions of the Association for Computational Linguistics.

[7] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[8] Richard Socher,et al. The Natural Language Decathlon: Multitask Learning as Question Answering , 2018, ArXiv.

[9] Deniz Yuret,et al. KUISAIL at SemEval-2020 Task 12: BERT-CNN for Offensive Speech Identification in Social Media , 2020, SEMEVAL.

[10] Muhammad Abdul-Mageed,et al. Enabling Deep Learning of Emotion With First-Person Seed Expressions , 2018, PEOPLES@NAACL-HTL.

[11] Thomas Eckart,et al. OSIAN: Open Source International Arabic News Corpus - Preparation and Integration into the CLARIN-infrastructure , 2019, WANLP@ACL 2019.

[12] Motaz Saad,et al. OSAC: Open Source Arabic Corpora , 2010 .

[13] Veselin Stoyanov,et al. Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[14] Muhammad Abdul-Mageed,et al. Understanding and Detecting Dangerous Speech in Social Media , 2020, OSACT.

[15] Guokun Lai,et al. RACE: Large-scale ReAding Comprehension Dataset From Examinations , 2017, EMNLP.

[16] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[17] Kareem Darwish,et al. Named Entity Recognition using Cross-lingual Resources: Arabic as an Example , 2013, ACL.

[18] Colin Raffel,et al. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer , 2021, NAACL.

[19] Muhammad Abdul-Mageed,et al. SAMAR: Subjectivity and sentiment analysis for Arabic social media , 2014, Comput. Speech Lang..

[20] Dat Quoc Nguyen,et al. PhoBERT: Pre-trained language models for Vietnamese , 2020, Findings.

[21] Nizar Habash,et al. The MADAR Shared Task on Arabic Fine-Grained Dialect Identification , 2019, WANLP@ACL 2019.

[22] Saif Mohammad,et al. Sentiment after Translation: A Case-Study on Arabic Social Media Posts , 2015, NAACL.

[23] Saif Mohammad,et al. SemEval-2018 Task 1: Affect in Tweets , 2018, *SEMEVAL.

[24] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[25] Hazem M. Hajj,et al. Multi-Task Learning using AraBert for Offensive Language Detection , 2020, OSACT.

[26] Ahmed Abdelali,et al. ALT Submission for OSACT Shared Task on Offensive Language Detection , 2020, OSACT.

[27] Fatemah Husain,et al. OSACT4 Shared Task on Offensive Language Detection: Intensive Preprocessing-Based Approach , 2020, OSACT.

[28] Muhammad Abdul-Mageed,et al. Deep Models for Arabic Dialect Identification on Benchmarked Data , 2018, VarDial@COLING 2018.

[29] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[30] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[31] Ibraheem Tuffaha,et al. Multi-dialect Arabic BERT for Country-level Dialect Identification , 2020, WANLP.

[32] Yan Xu,et al. Incorporating Word and Subword Units in Unsupervised Machine Translation Using Language Model Rescoring , 2019, WMT.

[33] Muhammad Abdul-Mageed,et al. AWATIF: A Multi-Genre Corpus for Modern Standard Arabic Subjectivity and Sentiment Analysis , 2012, LREC.

[34] Martin Malmsten,et al. Playing with Words at the National Library of Sweden - Making a Swedish BERT , 2020, ArXiv.

[35] Hazem Hajj,et al. AraBERT: Transformer-based Model for Arabic Language Understanding , 2020, OSACT.

[36] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[37] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[38] Mike Schuster,et al. Japanese and Korean voice search , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[40] AbdelRahim A. Elmadany,et al. ArSAS : An Arabic Speech-Act and Sentiment Corpus of Tweets , 2018 .

[41] Muhammad Abdul-Mageed,et al. Machine Generation and Detection of Arabic Manipulated and Fake News , 2020, WANLP.

[42] Nadir Durrani,et al. Farasa: A Fast and Furious Segmenter for Arabic , 2016, NAACL.

[43] Sanja Fidler,et al. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[44] Walid Magdy,et al. From Arabic Sentiment Analysis to Sarcasm Detection: The ArSarcasm Dataset , 2020, OSACT.

[45] Omer Levy,et al. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[46] Chris Callison-Burch,et al. Arabic Dialect Identification , 2014, CL.

[47] Murat Can Ganiz,et al. Semantic text classification: A survey of past and recent advances , 2018, Inf. Process. Manag..

[48] Roland Vollgraf,et al. FLAIR: An Easy-to-Use Framework for State-of-the-Art NLP , 2019, NAACL.

[49] Sampo Pyysalo,et al. WikiBERT Models: Deep Transfer Learning for Many Languages , 2020, NODALIDA.

[50] Tomas Mikolov,et al. Advances in Pre-Training Distributed Word Representations , 2017, LREC.