Self-Training Pre-Trained Language Models for Zero- and Few-Shot Multi-Dialectal Arabic Sequence Labeling

A sufficient amount of annotated data is usually required to fine-tune pre-trained language models for downstream tasks. Unfortunately, attaining labeled data can be costly, especially for multiple language varieties and dialects. We propose to self-train pre-trained language models in zero- and few-shot scenarios to improve performance on data-scarce varieties using only resources from data-rich ones. We demonstrate the utility of our approach in the context of Arabic sequence labeling by using a language model fine-tuned on Modern Standard Arabic (MSA) only to predict named entities (NE) and part-of-speech (POS) tags on several dialectal Arabic (DA) varieties. We show that self-training is indeed powerful, improving zero-shot MSA-to-DA transfer by as large as ˷10% F_1 (NER) and 2% accuracy (POS tagging). We acquire even better performance in few-shot scenarios with limited amounts of labeled data. We conduct an ablation study and show that the performance boost observed directly results from training data augmentation possible with DA examples via self-training. This opens up opportunities for developing DA models exploiting only MSA resources. Our approach can also be extended to other languages and tasks.

[1]  Muhammad Abdul-Mageed,et al.  Understanding and Detecting Dangerous Speech in Social Media , 2020, OSACT.

[2]  Kenji Sagae Self-Training without Reranking for Parser Domain Adaptation and Its Impact on Semantic Role Labeling , 2010 .

[3]  Nizar Habash,et al.  The MADAR Shared Task on Arabic Fine-Grained Dialect Identification , 2019, WANLP@ACL 2019.

[4]  Susan C. Herring,et al.  The Multilingual Internet: Language, Culture, and Communication Online , 2007 .

[5]  Naveen Arivazhagan,et al.  Small and Practical BERT Models for Sequence Labeling , 2019, EMNLP.

[6]  Chris Callison-Burch,et al.  Arabic Dialect Identification , 2014, CL.

[7]  Yassine Benajiba,et al.  ANERsys: An Arabic Named Entity Recognition System Based on Maximum Entropy , 2009, CICLing.

[8]  Nizar Habash,et al.  Developing an Egyptian Arabic Treebank: Impact of Dialectal Morphology on Annotation and Tool Development , 2014, LREC.

[9]  An Yang,et al.  Enhancing Pre-Trained Language Representations with Rich Knowledge for Machine Reading Comprehension , 2019, ACL.

[10]  Xu Tan,et al.  MASS: Masked Sequence to Sequence Pre-training for Language Generation , 2019, ICML.

[11]  Eduard H. Hovy,et al.  End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF , 2016, ACL.

[12]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[13]  Hazem Hajj,et al.  AraBERT: Transformer-based Model for Arabic Language Understanding , 2020, OSACT.

[14]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[15]  Mahmoud Al-Ayyoub,et al.  Deep learning for Arabic NLP: A survey , 2017, J. Comput. Sci..

[16]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[17]  Abdulhadi Shoufan,et al.  Natural Language Processing for Dialectical Arabic: A Survey , 2015, ANLP@ACL.

[18]  Wei Xu,et al.  Can artificial neural networks learn language models? , 2000, INTERSPEECH.

[19]  Khaled Shaalan,et al.  Integrating Rule-Based System with Classification for Arabic Named Entity Recognition , 2012, CICLing.

[20]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[21]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[22]  S. Herring,et al.  Computer‐Mediated Discourse 2.0 , 2015 .

[23]  Chris Callison-Burch,et al.  The Arabic Online Commentary Dataset: an Annotated Dataset of Informal Arabic with High Dialectal Content , 2011, ACL.

[24]  Zornitsa Kozareva,et al.  Self-training and Co-training Applied to Spanish Named Entity Recognition , 2005, MICAI.

[25]  Khaled Shaalan,et al.  Character convolutions for Arabic Named Entity Recognition with Long Short-Term Memory Networks , 2019, Comput. Speech Lang..

[26]  Heng Ji,et al.  Cross-lingual Name Tagging and Linking for 282 Languages , 2017, ACL.

[27]  Muhammad Abdul-Mageed,et al.  You Tweet What You Speak: A City-Level Dataset of Arabic Dialects , 2018, LREC.

[28]  Laura Kallmeyer,et al.  Multi-Dialect Arabic POS Tagging: A CRF Approach , 2018, LREC.

[29]  Khaled Shaalan,et al.  A Survey of Arabic Named Entity Recognition and Classification , 2014, CL.

[30]  M. Maamouri,et al.  The Penn Arabic Treebank: Building a Large-Scale Annotated Arabic Corpus , 2004 .

[31]  Ali Selamat,et al.  Combination of active learning and self-training for cross-lingual sentiment classification with density analysis of unlabelled samples , 2015, Inf. Sci..

[32]  Muhammad Abdul-Mageed,et al.  AraNet: A Deep Learning Toolkit for Arabic Social Media , 2020, OSACT.

[33]  Young-Bum Kim,et al.  Cross-Lingual Transfer Learning for POS Tagging without Cross-Lingual Resources , 2017, EMNLP.

[34]  Nizar Habash,et al.  NADI 2020: The First Nuanced Arabic Dialect Identification Shared Task , 2020, WANLP.

[35]  Yaser Jararweh,et al.  Transfer Learning for Arabic Named Entity Recognition With Deep Neural Networks , 2020, IEEE Access.

[36]  Wen Wang,et al.  Semi-Supervised Learning for Part-of-Speech Tagging of Mandarin Transcribed Speech , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[37]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[38]  Muhammad Abdul-Mageed,et al.  Deep Models for Arabic Dialect Identification on Benchmarked Data , 2018, VarDial@COLING 2018.

[39]  Walter Daelemans,et al.  Predicting the Effectiveness of Self-Training: Application to Sentiment Classification , 2016, ArXiv.

[40]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[41]  Gerard de Melo,et al.  A Robust Self-Learning Framework for Cross-Lingual Text Classification , 2019, EMNLP.

[42]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[43]  Mourad Gridach,et al.  Character-Aware Neural Networks for Arabic Named Entity Recognition for Social Media , 2016, WSSANLP@COLING.

[44]  Jaime G. Carbonell,et al.  Neural Cross-Lingual Named Entity Recognition with Minimal Resources , 2018, EMNLP.

[45]  Muhammad Abdul-Mageed,et al.  ARBERT & MARBERT: Deep Bidirectional Transformers for Arabic , 2020, ACL.

[46]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[47]  Kareem Darwish,et al.  Named Entity Recognition using Cross-lingual Resources: Arabic as an Example , 2013, ACL.

[48]  Khaled Shaalan,et al.  A hybrid approach to Arabic named entity recognition , 2014, J. Inf. Sci..

[49]  Nabil Laachfoubi,et al.  Arabic named entity recognition using deep learning approach , 2019, International Journal of Electrical and Computer Engineering (IJECE).

[50]  Alexander Erdmann,et al.  CAMeL Tools: An Open Source Python Toolkit for Arabic Natural Language Processing , 2020, LREC.