论文信息 - The Interplay of Variant, Size, and Task Type in Arabic Pre-trained Language Models

The Interplay of Variant, Size, and Task Type in Arabic Pre-trained Language Models

In this paper, we explore the effects of language variants, data sizes, and fine-tuning task types in Arabic pre-trained language models. To do so, we build three pre-trained language models across three variants of Arabic: Modern Standard Arabic (MSA), dialectal Arabic, and classical Arabic, in addition to a fourth language model which is pre-trained on a mix of the three. We also examine the importance of pre-training data size by building additional models that are pre-trained on a scaled-down set of the MSA variant. We compare our different models to each other, as well as to eight publicly available models by fine-tuning them on five NLP tasks spanning 12 datasets. Our results suggest that the variant proximity of pre-training data to fine-tuning data is more important than the pre-training data size. We exploit this insight in defining an optimized system selection model for the studied tasks.

[1] Ahmed Abdelali,et al. QADI: Arabic Dialect Identification in the Wild , 2020, WANLP.

[2] Nora Al-Twairesh,et al. SUAR: Towards Building a Corpus for the Saudi Dialect , 2018, ACLING.

[3] Christopher D. Manning,et al. Finding Universal Grammatical Relations in Multilingual BERT , 2020, ACL.

[4] Houda Bouamor,et al. Fine-Grained Arabic Dialect Identification , 2018, COLING.

[5] Preslav Nakov,et al. SemEval-2016 Task 4: Sentiment Analysis in Twitter , 2016, *SEMEVAL.

[6] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[7] Kemal Oflazer,et al. The MADAR Arabic Dialect Corpus and Lexicon , 2018, LREC.

[8] Amir F. Atiya,et al. ASTD: Arabic Sentiment Tweets Dataset , 2015, EMNLP.

[9] Muhammad Abdul-Mageed,et al. ARBERT & MARBERT: Deep Bidirectional Transformers for Arabic , 2020, ACL.

[10] Joakim Nivre,et al. Do Neural Language Models Show Preferences for Syntactic Formalisms? , 2020, ACL.

[11] Christopher D. Manning,et al. A Structural Probe for Finding Syntax in Word Representations , 2019, NAACL.

[12] Alexander Erdmann,et al. CAMeL Tools: An Open Source Python Toolkit for Arabic Natural Language Processing , 2020, LREC.

[13] Alex Wang,et al. What do you learn from context? Probing for sentence structure in contextualized word representations , 2019, ICLR.

[14] Yonatan Belinkov,et al. Linguistic Knowledge and Transferability of Contextual Representations , 2019, NAACL.

[15] Iryna Gurevych,et al. How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models , 2021, ACL/IJCNLP.

[16] Nizar Habash,et al. The MADAR Shared Task on Arabic Fine-Grained Dialect Identification , 2019, WANLP@ACL 2019.

[17] Dipanjan Das,et al. BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.

[18] M. Maamouri,et al. The Penn Arabic Treebank: Building a Large-Scale Annotated Arabic Corpus , 2004 .

[19] Mahmoud El-Haj,et al. Habibi - a multi Dialect multi National Arabic Song Lyrics Corpus , 2020, LREC.

[20] Nizar Habash,et al. Curras: an annotated corpus for the Palestinian Arabic dialect , 2017, Lang. Resour. Evaluation.

[21] Kemal Oflazer,et al. A Multidialectal Parallel Corpus of Arabic , 2014, LREC.

[22] Karima Meftouh,et al. Machine Translation Experiments on PADIC: A Parallel Arabic DIalect Corpus , 2015, PACLIC.

[23] Benoît Sagot,et al. What Does BERT Learn about the Structure of Language? , 2019, ACL.

[24] Nizar Habash,et al. A Morphologically Annotated Corpus of Emirati Arabic , 2018, LREC.

[25] Nizar Habash,et al. NADI 2020: The First Nuanced Arabic Dialect Identification Shared Task , 2020, WANLP.

[26] Nizar Habash,et al. MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic , 2014, LREC.

[27] Walid Magdy,et al. Mazajak: An Online Arabic Sentiment Analyser , 2019, WANLP@ACL 2019.

[28] Thomas Wolf,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[29] Stergios Chatzikyriakidis,et al. Shami: A Corpus of Levantine Arabic Dialects , 2018, LREC.

[30] Thomas Eckart,et al. OSIAN: Open Source International Arabic News Corpus - Preparation and Integration into the CLARIN-infrastructure , 2019, WANLP@ACL 2019.

[31] Ryan Cotterell,et al. A Multi-Dialect, Multi-Genre Corpus of Informal Written Arabic , 2014, LREC.

[32] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[33] Samuel R. Bowman,et al. When Do You Need Billions of Words of Pretraining Data? , 2020, ACL.

[34] Laurent Romary,et al. A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages , 2020, ACL.

[35] Chris Callison-Burch,et al. The Arabic Online Commentary Dataset: an Annotated Dataset of Informal Arabic with High Dialectal Content , 2011, ACL.

[36] Wajdi Zaghouani,et al. Arap-Tweet: A Large Multi-Dialect Twitter Corpus for Gender, Age and Language Variety Identification , 2018, LREC.

[37] Goran Glavas,et al. Probing Pretrained Language Models for Lexical Semantics , 2020, EMNLP.

[38] Vincent Micheli,et al. On the Importance of Pre-training Data Volume for Compact Language Models , 2020, EMNLP.

[39] Mike Schuster,et al. Japanese and Korean voice search , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40] AbdelRahim A. Elmadany,et al. ArSAS : An Arabic Speech-Act and Sentiment Corpus of Tweets , 2018 .

[41] Deniz Yuret,et al. KUISAIL at SemEval-2020 Task 12: BERT-CNN for Offensive Speech Identification in Social Media , 2020, SEMEVAL.

[42] Hazem M. Hajj,et al. ArSentD-LEV: A Multi-Topic Corpus for Target-based Sentiment Analysis in Arabic Levantine Tweets , 2019, ArXiv.

[43] Yassine Benajiba,et al. ANERsys: An Arabic Named Entity Recognition System Based on Maximum Entropy , 2009, CICLing.

[44] Ibraheem Tuffaha,et al. Multi-dialect Arabic BERT for Country-level Dialect Identification , 2020, WANLP.

[45] Hazem Hajj,et al. AraBERT: Transformer-based Model for Arabic Language Understanding , 2020, OSACT.

[46] Kemal Oflazer,et al. YouDACC: the Youtube Dialectal Arabic Comment Corpus , 2014, LREC.

[47] Mark Dredze,et al. Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT , 2019, EMNLP.

[48] Waleed A. Yousef,et al. Learning meters of Arabic and English poems with Recurrent Neural Networks: a step forward for language understanding and synthesis , 2019, ArXiv.

[49] Sarah Bowen Savant,et al. OpenITI: a Machine-Readable Corpus of Islamicate Texts , 2020 .

[50] Preslav Nakov,et al. WERD: Using social text spelling variants for evaluating dialectal speech recognition , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[51] Nadir Durrani,et al. Farasa: A Fast and Furious Segmenter for Arabic , 2016, NAACL.

[52] K. Almeman,et al. Automatic building of Arabic multi dialect text corpora by bootstrapping dialect words , 2013, 2013 1st International Conference on Communications, Signal Processing, and their Applications (ICCSPA).