Investigation of the Feature Selection Problem for Sentiment Analysis in Arabic Language

Sentiment analysis, which is also known as opinion mining, can be defined as the process of the automatic detection of the attitude of an author towards a certain subject in textual contents. In this study we design and implement a document-level supervised sentiment analysis system for Arabic context and investigate its performance. We use three different feature extraction methods in order to generate three different datasets (unigrams, bigrams and trigrams) from the Opinion Corpus for Arabic (OCA). In order to find the optimal number of features and to obtain the best time performance in sentiment analysis, we employ two feature ranking methods (Information Gain based and Chi-Square based) and calculate the score of each feature with respect to the class labels. This feature ranking step selects only the features that are relevant to the class labels and removes the irrelevant features that cause unnecessary processing. Hence, it helps to increase the classification performance and reduce the processing time. Finally, we evaluate the performance of three standard classifiers for polarity on the previously generated unigram and bigram based data sets, namely Support Vector Machines, K-Nearest Neighbor and Decision Tree, known by their effectiveness over these types of datasets. In our study SVM classifier has showed superior classification performance compared to the other two classifiers. Our experimentation results also prove the effectiveness of the two feature selection methods we use in order to reduce the feature space of the generated datasets and provide higher classification performance.

[1]  John Carroll,et al.  Weakly supervised techniques for domain-independent sentiment classification , 2009, TSA@CIKM.

[2]  Allan P. White,et al.  Technical Note: Bias in Information-Based Measures in Decision Tree Induction , 1994, Machine Learning.

[3]  Khaled Shaalan,et al.  Sentiment Analysis in Arabic , 2015, NLDB.

[4]  Rada Mihalcea,et al.  Learning Multilingual Subjective Language via Cross-Lingual Projections , 2007, ACL.

[5]  Christiane Fellbaum,et al.  Introducing the Arabic WordNet project , 2006 .

[6]  Muhammad Abdul-Mageed,et al.  SANA: A Large Scale Multi-Genre, Multi-Dialect Lexicon for Arabic Subjectivity and Sentiment Analysis , 2014, LREC.

[7]  Luca Viganò,et al.  Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) , 2015, IWSEC 2015.

[8]  Khaled Nagi,et al.  Sentiment Analysis of Colloquial Arabic Tweets , 2014 .

[9]  Hazem M. Hajj,et al.  Sentence-Level and Document-Level Sentiment Mining for Arabic Texts , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[10]  Glenn Fung,et al.  Multicategory Proximal Support Vector Machine Classifiers , 2005, Machine Learning.

[11]  Rohini K. Srihari,et al.  Feature selection for text categorization on imbalanced data , 2004, SKDD.

[12]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[13]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Mahmoud Al-Ayyoub,et al.  Arabic sentiment analysis: Lexicon-based and corpus-based , 2013, 2013 IEEE Jordan Conference on Applied Electrical Engineering and Computing Technologies (AEECT).

[15]  Girish K. Patnaik,et al.  Analyzing Sentiment of Movie Review Data using Naive Bayes Neural Classifier , 2014 .

[16]  Luis Alfonso Ureña López,et al.  OCA: Opinion corpus for Arabic , 2011, J. Assoc. Inf. Sci. Technol..

[17]  Mahmoud Al-Ayyoub,et al.  An extended analytical study of Arabic sentiments , 2014, Int. J. Big Data Intell..

[18]  Muhammad Abdul-Mageed,et al.  SAMAR: Subjectivity and sentiment analysis for Arabic social media , 2014, Comput. Speech Lang..

[19]  Abdel-Rahman Hedar,et al.  Sentiment Analysis of Arabic Slang Comments on Facebook , 2014, BIOINFORMATICS 2014.

[20]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[21]  Belur V. Dasarathy,et al.  Nearest neighbor (NN) norms: NN pattern classification techniques , 1991 .

[22]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[23]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[24]  A. Shoukry,et al.  Sentence-level Arabic sentiment analysis , 2012, 2012 International Conference on Collaboration Technologies and Systems (CTS).

[25]  Patricio Martínez-Barco,et al.  Subjectivity and sentiment analysis: An overview of the current state of the area and envisaged developments , 2012, Decis. Support Syst..

[26]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[27]  Saptarsi Goswami,et al.  Empirical Study on Filter based Feature Selection Methods for Text Classification , 2013 .

[28]  Andrea Esuli,et al.  SentiWordNet 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining , 2010, LREC.

[29]  M. Pasquier,et al.  Key issues in conducting sentiment analysis on Arabic social media text , 2013, 2013 9th International Conference on Innovations in Information Technology (IIT).

[30]  Nizar Habash,et al.  A Large Scale Arabic Sentiment Lexicon for Arabic Opinion Mining , 2014, ANLP@EMNLP.

[31]  Houda Benbrahim,et al.  An empirical study to address the problem of Unbalanced Data Sets in sentiment classification , 2012, 2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC).

[32]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[33]  Maite Taboada,et al.  Lexicon-Based Methods for Sentiment Analysis , 2011, CL.