Contextual semantic embeddings based on fine-tuned AraBERT model for Arabic text multi-class categorization

Abstract Despite that pre-trained word embedding models have advanced a wide range of natural language processing applications, they ignore the contextual information and meaning within the text. In this paper, we investigate the potential of the pre-trained Arabic BERT (Bidirectional Encoder Representations from Transformers) model to learn universal contextualized sentence representations aiming to showcase its usefulness for Arabic text Multi-class categorization. We propose to exploit the pre-trained AraBERT for contextual text representation learning in two different ways, transfer learning model and feature extractor. On the one hand, we employ the Arabic BERT (AraBERT) model after fine-tuning its parameters on the OSAC datasets to transfer its knowledge for the Arabic text categorization. On the other hand, we inquire into AraBERT performance, as a feature extractor model, by combining it with several classifiers, including CNN, LSTM, Bi-LSTM, MLP, and SVM. Finally, we conduct an exhaustive set of experiments comparing two BERT models, namely AraBERT and multilingual BERT. The findings show that the fine-tuned AraBERT model accomplishes state-of-the-art performance results and attains up to 99% in terms of F1-score and accuracy.

[1]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[2]  Said Ouatik El Alaoui,et al.  Word Sense Representation based-method for Arabic Text Categorization , 2018, 2018 9th International Symposium on Signal, Image, Video and Communications (ISIVC).

[3]  Xuanjing Huang,et al.  How to Fine-Tune BERT for Text Classification? , 2019, CCL.

[4]  Nadir Durrani,et al.  Farasa: A Fast and Furious Segmenter for Arabic , 2016, NAACL.

[5]  Prakhar Gupta,et al.  Learning Word Vectors for 157 Languages , 2018, LREC.

[6]  Hazem M. Hajj,et al.  hULMonA: The Universal Language Model in Arabic , 2019, WANLP@ACL 2019.

[7]  Said Ouatik El Alaoui,et al.  Deep Neural Models and Retrofitting for Arabic Text Categorization , 2020, Int. J. Intell. Inf. Technol..

[8]  Matthew England,et al.  A Combined CNN and LSTM Model for Arabic Sentiment Analysis , 2018, CD-MAKE.

[9]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[10]  Mahmoud Al-Ayyoub,et al.  Automatic Arabic text categorization: A comprehensive comparative study , 2015, J. Inf. Sci..

[11]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[12]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[13]  Pengfei Duan,et al.  Word Embeddings and Convolutional Neural Network for Arabic Sentiment Classification , 2016, COLING.

[14]  Abdelwadood Moh'd. Mesleh,et al.  Feature sub-set selection metrics for Arabic text classification , 2011, Pattern Recognit. Lett..

[15]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[16]  Motaz Saad,et al.  OSAC: Open Source Arabic Corpora , 2010 .