AraSenCorpus: A Semi-Supervised Approach for Sentiment Annotation of a Large Arabic Text Corpus

At a time when research in the field of sentiment analysis tends to study advanced topics in languages, such as English, other languages such as Arabic still suffer from basic problems and challenges, most notably the availability of large corpora. Furthermore, manual annotation is time-consuming and difficult when the corpus is too large. This paper presents a semi-supervised self-learning technique, to extend an Arabic sentiment annotated corpus with unlabeled data, named AraSenCorpus. We use a neural network to train a set of models on a manually labeled dataset containing 15,000 tweets. We used these models to extend the corpus to a large Arabic sentiment corpus called “AraSenCorpus”. AraSenCorpus contains 4.5 million tweets and covers both modern standard Arabic and some of the Arabic dialects. The long-short term memory (LSTM) deep learning classifier is used to train and test the final corpus. We evaluate our proposed framework on two external benchmark datasets to ensure the improvement of the Arabic sentiment classification. The experimental results show that our corpus outperforms the existing state-of-the-art systems.

[1]  Christopher T. Kello,et al.  On the physical origin of linguistic laws and lognormality in speech , 2019, Royal Society Open Science.

[2]  Hatem Haddad,et al.  Empirical Evaluation of Leveraging Named Entities for Arabic Sentiment Analysis , 2019, Int. Arab J. Inf. Technol..

[3]  Abdel-Badeeh M. Salem,et al.  Twitter Benchmark Dataset for Arabic Sentiment Analysis , 2019, International Journal of Modern Education and Computer Science.

[4]  J. Lavid,et al.  Towards a ‘Science’ of Corpus Annotation: A New Methodological Challenge for Corpus Linguistics , 2013 .

[5]  Mounir Zrigui,et al.  Using Tweets and Emojis to Build TEAD: an Arabic Dataset for Sentiment Analysis , 2018, Computación y Sistemas.

[6]  Muhammad Shahbaz,et al.  Tracking sentiment towards news entities from Arabic news on social media , 2021, Future Gener. Comput. Syst..

[7]  A. Elnagar,et al.  Hotel Arabic-Reviews Dataset Construction for Sentiment Analysis Applications , 2018 .

[8]  Muhammad Badruddin Khan,et al.  Identifying comparative opinions in Arabic text in social media using machine learning techniques , 2019, SN Applied Sciences.

[9]  Jalal Omer Atoum,et al.  Sentiment Analysis of Arabic Jordanian Dialect Tweets , 2019, International Journal of Advanced Computer Science and Applications.

[10]  Nemanja Spasojevic,et al.  Actionable and Political Text Classification using Word Embeddings and LSTM , 2016, ArXiv.

[11]  Ashraf Elnagar,et al.  An Annotated Huge Dataset for Standard and Colloquial Arabic Reviews for Subjective Sentiment Analysis , 2018, ACLING.

[12]  Hazem M. Hajj,et al.  Comparative Evaluation of Sentiment Analysis Methods Across Arabic Dialects , 2017, ACLING.

[13]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[14]  Salwani Abdullah,et al.  Arabic senti-lexicon: Constructing publicly available language resources for Arabic sentiment analysis , 2018, J. Inf. Sci..

[15]  Ali Al-Laith,et al.  Monitoring People's Emotions and Symptoms from Arabic Tweets during the COVID-19 Pandemic , 2021, Inf..

[16]  Nora Al-Twairesh,et al.  Surface and Deep Features Ensemble for Sentiment Analysis of Arabic Tweets , 2019, IEEE Access.

[17]  Stanley C. Fralick,et al.  Learning to recognize patterns without a teacher , 1967, IEEE Trans. Inf. Theory.

[18]  Mahmoud Al-Ayyoub,et al.  Towards Improving the Lexicon-Based Approach for Arabic Sentiment Analysis , 2014, Int. J. Inf. Technol. Web Eng..

[19]  Hend Suliman Al-Khalifa,et al.  AraSenTi-Tweet: A Corpus for Arabic Sentiment Analysis of Saudi Tweets , 2017, ACLING.

[20]  Samhaa R. El-Beltagy,et al.  MoArLex: An Arabic Sentiment Lexicon Built Through Automatic Lexicon Expansion , 2018, ACLING.

[21]  Hazem M. Hajj,et al.  ArSentD-LEV: A Multi-Topic Corpus for Target-based Sentiment Analysis in Arabic Levantine Tweets , 2019, ArXiv.

[22]  Andreas Witt,et al.  Internet Corpora: A Challenge for Linguistic Processing , 2014, Datenbank-Spektrum.

[23]  Mahieddine Djoudi,et al.  SANA : Sentiment Analysis on Newspapers comments in Algeria , 2019, J. King Saud Univ. Comput. Inf. Sci..

[24]  Cagatay CATAL,et al.  A sentiment classification model based on multiple classifiers , 2017, Appl. Soft Comput..

[25]  Ayoub Ait Lahcen,et al.  ASA: A framework for Arabic sentiment analysis , 2020, J. Inf. Sci..