JABER: Junior Arabic BERt

Language-specific pre-trained models have proven to be more accurate than multilingual ones in a monolingual evaluation setting, Arabic is no exception. However, we found that previously released Arabic BERT models were significantly under-trained. In this technical report, we present JABER, Junior Arabic BERt, our pretrained language model prototype dedicated for Arabic. We conduct an empirical study to systematically evaluate the performance of models across a diverse set of existing Arabic NLU tasks. Experimental results show that JABER achieves the state-of-the-art performances on ALUE, a new benchmark for Arabic Language Understanding Evaluation, as well as on a well-established NER benchmark.

[1]  Kaisheng M. Wang,et al.  PanGu-α: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation , 2021, ArXiv.

[2]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[3]  Deniz Yuret,et al.  KUISAIL at SemEval-2020 Task 12: BERT-CNN for Offensive Speech Identification in Social Media , 2020, SEMEVAL.

[4]  Qun Liu,et al.  Training Multilingual Pre-trained Language Model with Byte-level Subwords , 2021, ArXiv.

[5]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[6]  Hao Tian,et al.  ERNIE 2.0: A Continual Pre-training Framework for Language Understanding , 2019, AAAI.

[7]  Rémi Louf,et al.  Transformers : State-ofthe-art Natural Language Processing , 2019 .

[8]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[9]  Thomas Eckart,et al.  OSIAN: Open Source International Arabic News Corpus - Preparation and Integration into the CLARIN-infrastructure , 2019, WANLP@ACL 2019.

[10]  Oren Etzioni,et al.  Green AI , 2019, Commun. ACM.

[11]  Ibraheem Tuffaha,et al.  ALUE: Arabic Language Understanding Evaluation , 2021, WANLP.

[12]  Kyle Gorman,et al.  We Need to Talk about Standard Splits , 2019, ACL.

[13]  Yu Sun,et al.  ERNIE: Enhanced Representation through Knowledge Integration , 2019, ArXiv.

[14]  Alexander Sergeev,et al.  Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[15]  Stefan Schweter,et al.  German's Next Language Model , 2020, COLING.

[16]  Quoc V. Le,et al.  A Simple Method for Commonsense Reasoning , 2018, ArXiv.

[17]  Guoao Wei,et al.  FewCLUE: A Chinese Few-shot Learning Evaluation Benchmark , 2021, ArXiv.

[18]  Laurent Romary,et al.  CamemBERT: a Tasty French Language Model , 2019, ACL.

[19]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[20]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[21]  Benoît Sagot,et al.  Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures , 2019 .

[22]  Muhammad Abdul-Mageed,et al.  ARBERT & MARBERT: Deep Bidirectional Transformers for Arabic , 2020, ACL.

[23]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[24]  Benjamin Lecouteux,et al.  FlauBERT: Unsupervised Language Model Pre-training for French , 2020, LREC.

[25]  Xiaozhe Ren,et al.  NEZHA: Neural Contextualized Representation for Chinese Language Understanding , 2019, ArXiv.

[26]  Alexander Erdmann,et al.  CAMeL Tools: An Open Source Python Toolkit for Arabic Natural Language Processing , 2020, LREC.

[27]  Tapio Salakoski,et al.  Multilingual is not enough: BERT for Finnish , 2019, ArXiv.

[28]  Matej Ulvcar,et al.  FinEst BERT and CroSloEngual BERT: less is more in multilingual models , 2020, TDS.

[29]  Tommaso Caselli,et al.  BERTje: A Dutch BERT Model , 2019, ArXiv.

[30]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[31]  Hao Tian,et al.  ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation , 2021, ArXiv.

[32]  Vishrav Chaudhary,et al.  CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data , 2019, LREC.

[33]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[34]  Eva Schlinger,et al.  How Multilingual is Multilingual BERT? , 2019, ACL.

[35]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[36]  Hazem Hajj,et al.  AraBERT: Transformer-based Model for Arabic Language Understanding , 2020, OSACT.

[37]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[38]  Yassine Benajiba,et al.  ANERsys 2.0: Conquering the NER Task for the Arabic Language by Combining the Maximum Entropy with POS-tag Information , 2007, IICAI.

[39]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[40]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[41]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.