论文信息 - Contextual semantic embeddings based on fine-tuned AraBERT model for Arabic text multi-class categorization

Contextual semantic embeddings based on fine-tuned AraBERT model for Arabic text multi-class categorization

Abstract Despite that pre-trained word embedding models have advanced a wide range of natural language processing applications, they ignore the contextual information and meaning within the text. In this paper, we investigate the potential of the pre-trained Arabic BERT (Bidirectional Encoder Representations from Transformers) model to learn universal contextualized sentence representations aiming to showcase its usefulness for Arabic text Multi-class categorization. We propose to exploit the pre-trained AraBERT for contextual text representation learning in two different ways, transfer learning model and feature extractor. On the one hand, we employ the Arabic BERT (AraBERT) model after fine-tuning its parameters on the OSAC datasets to transfer its knowledge for the Arabic text categorization. On the other hand, we inquire into AraBERT performance, as a feature extractor model, by combining it with several classifiers, including CNN, LSTM, Bi-LSTM, MLP, and SVM. Finally, we conduct an exhaustive set of experiments comparing two BERT models, namely AraBERT and multilingual BERT. The findings show that the fine-tuned AraBERT model accomplishes state-of-the-art performance results and attains up to 99% in terms of F1-score and accuracy.

[1] Quoc V. Le,et al. Distributed Representations of Sentences and Documents , 2014, ICML.

[2] Said Ouatik El Alaoui,et al. Word Sense Representation based-method for Arabic Text Categorization , 2018, 2018 9th International Symposium on Signal, Image, Video and Communications (ISIVC).

[3] Xuanjing Huang,et al. How to Fine-Tune BERT for Text Classification? , 2019, CCL.

[4] Nadir Durrani,et al. Farasa: A Fast and Furious Segmenter for Arabic , 2016, NAACL.

[5] Prakhar Gupta,et al. Learning Word Vectors for 157 Languages , 2018, LREC.

[6] Hazem M. Hajj,et al. hULMonA: The Universal Language Model in Arabic , 2019, WANLP@ACL 2019.

[7] Said Ouatik El Alaoui,et al. Deep Neural Models and Retrofitting for Arabic Text Categorization , 2020, Int. J. Intell. Inf. Technol..

[8] Matthew England,et al. A Combined CNN and LSTM Model for Arabic Sentiment Analysis , 2018, CD-MAKE.

[9] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[10] Mahmoud Al-Ayyoub,et al. Automatic Arabic text categorization: A comprehensive comparative study , 2015, J. Inf. Sci..

[11] Sebastian Ruder,et al. Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[12] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[13] Pengfei Duan,et al. Word Embeddings and Convolutional Neural Network for Arabic Sentiment Classification , 2016, COLING.

[14] Abdelwadood Moh'd. Mesleh,et al. Feature sub-set selection metrics for Arabic text classification , 2011, Pattern Recognit. Lett..

[15] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[16] Motaz Saad,et al. OSAC: Open Source Arabic Corpora , 2010 .