论文信息 - JABER: Junior Arabic BERt - 字舞流文

JABER: Junior Arabic BERt

Language-specific pre-trained models have proven to be more accurate than multilingual ones in a monolingual evaluation setting, Arabic is no exception. However, we found that previously released Arabic BERT models were significantly under-trained. In this technical report, we present JABER, Junior Arabic BERt, our pretrained language model prototype dedicated for Arabic. We conduct an empirical study to systematically evaluate the performance of models across a diverse set of existing Arabic NLU tasks. Experimental results show that JABER achieves the state-of-the-art performances on ALUE, a new benchmark for Arabic Language Understanding Evaluation, as well as on a well-established NER benchmark.

Philippe Langlais | Abbas Ghaddar | Duan Xinyu | Baoxing Huai | Qun Liu | Yasheng Wang | Xin Jiang | Ahmad Rashid | Mehdi Rezagholizadeh | Zhefeng Wang | Chao Xing | Yimeng Wu | Khalil Bibi | Ahmad Rashid | Qun Liu | P. Langlais | Zhefeng Wang | Baoxing Huai | Duan Xinyu | Yasheng Wang | Xin Jiang | Mehdi Rezagholizadeh | Abbas Ghaddar | Khalil Bibi | Chao Xing | Yimeng Wu

[1] Kaisheng M. Wang,et al. PanGu-α: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation , 2021, ArXiv.

[2] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.

[3] Deniz Yuret,et al. KUISAIL at SemEval-2020 Task 12: BERT-CNN for Offensive Speech Identification in Social Media , 2020, SEMEVAL.

[4] Qun Liu,et al. Training Multilingual Pre-trained Language Model with Byte-level Subwords , 2021, ArXiv.

[5] Omer Levy,et al. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[6] Hao Tian,et al. ERNIE 2.0: A Continual Pre-training Framework for Language Understanding , 2019, AAAI.

[7] Rémi Louf,et al. Transformers : State-ofthe-art Natural Language Processing , 2019 .

[8] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[9] Thomas Eckart,et al. OSIAN: Open Source International Arabic News Corpus - Preparation and Integration into the CLARIN-infrastructure , 2019, WANLP@ACL 2019.

[10] Oren Etzioni,et al. Green AI , 2019, Commun. ACM.

[11] Ibraheem Tuffaha,et al. ALUE: Arabic Language Understanding Evaluation , 2021, WANLP.

[12] Kyle Gorman,et al. We Need to Talk about Standard Splits , 2019, ACL.

[13] Yu Sun,et al. ERNIE: Enhanced Representation through Knowledge Integration , 2019, ArXiv.

[14] Alexander Sergeev,et al. Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[15] Stefan Schweter,et al. German's Next Language Model , 2020, COLING.

[16] Quoc V. Le,et al. A Simple Method for Commonsense Reasoning , 2018, ArXiv.

[17] Guoao Wei,et al. FewCLUE: A Chinese Few-shot Learning Evaluation Benchmark , 2021, ArXiv.

[18] Laurent Romary,et al. CamemBERT: a Tasty French Language Model , 2019, ACL.

[19] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[20] Veselin Stoyanov,et al. Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[21] Benoît Sagot,et al. Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures , 2019 .

[22] Muhammad Abdul-Mageed,et al. ARBERT & MARBERT: Deep Bidirectional Transformers for Arabic , 2020, ACL.

[23] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[24] Benjamin Lecouteux,et al. FlauBERT: Unsupervised Language Model Pre-training for French , 2020, LREC.

[25] Xiaozhe Ren,et al. NEZHA: Neural Contextualized Representation for Chinese Language Understanding , 2019, ArXiv.

[26] Alexander Erdmann,et al. CAMeL Tools: An Open Source Python Toolkit for Arabic Natural Language Processing , 2020, LREC.

[27] Tapio Salakoski,et al. Multilingual is not enough: BERT for Finnish , 2019, ArXiv.

[28] Matej Ulvcar,et al. FinEst BERT and CroSloEngual BERT: less is more in multilingual models , 2020, TDS.

[29] Tommaso Caselli,et al. BERTje: A Dutch BERT Model , 2019, ArXiv.

[30] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[31] Hao Tian,et al. ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation , 2021, ArXiv.

[32] Vishrav Chaudhary,et al. CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data , 2019, LREC.

[33] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[34] Eva Schlinger,et al. How Multilingual is Multilingual BERT? , 2019, ACL.

[35] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[36] Hazem Hajj,et al. AraBERT: Transformer-based Model for Arabic Language Understanding , 2020, OSACT.

[37] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[38] Yassine Benajiba,et al. ANERsys 2.0: Conquering the NER Task for the Arabic Language by Combining the Maximum Entropy with POS-tag Information , 2007, IICAI.

[39] Rico Sennrich,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[40] Sanja Fidler,et al. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[41] Erik F. Tjong Kim Sang,et al. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.