Optimal Feature Selection for Sentiment Analysis

Sentiment Analysis (SA) research has increased tremendously in recent times. Sentiment analysis deals with the methods that automatically process the text contents and extract the opinion of the users. In this paper, unigram and bi-grams are extracted from the text, and composite features are created using them. Part of Speech (POS) based features adjectives and adverbs are also extracted. Information Gain (IG) and Minimum Redundancy Maximum Relevancy (mRMR) feature selection methods are used to extract prominent features. Further, effect of various feature sets for sentiment classification is investigated using machine learning methods. Effects of different categories of features are investigated on four standard datasets i.e. Movie review, product (book, DVD and electronics) review dataset. Experimental results show that composite features created from prominent features of unigram and bi-gram perform better than other features for sentiment classification. mRMR is better feature selection method as compared to IG for sentiment classification. Boolean Multinomial Naive Bayes (BMNB) algorithm performs better than Support Vector Machine (SVM) classifier for sentiment analysis in terms of accuracy and execution time.

[1]  John Blitzer,et al.  Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification , 2007, ACL.

[2]  Bing Liu,et al.  Sentiment Analysis and Subjectivity , 2010, Handbook of Natural Language Processing.

[3]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[4]  Peter D. Turney Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews , 2002, ACL.

[5]  Pushpak Bhattacharyya,et al.  Incorporating Semantic Knowledge for Sentiment Analysis , 2008 .

[6]  Bo Pang,et al.  A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , 2004, ACL.

[7]  Ian Witten,et al.  Data Mining , 2000 .

[8]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Hsinchun Chen,et al.  Sentiment analysis in multiple languages: Feature selection for opinion classification in Web forums , 2008, TOIS.

[10]  Nigel Collier,et al.  Sentiment Analysis using Support Vector Machines with Diverse Information Sources , 2004, EMNLP.

[11]  Namita Mittal,et al.  Categorical Probability Proportion Difference (CPPD): A Feature Selection Method for Sentiment Classification , 2012 .

[12]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[13]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[14]  Fu Lee Wang,et al.  Web Information Systems and Mining , 2010, Lecture Notes in Computer Science.

[15]  Hsinchun Chen,et al.  Selecting Attributes for Sentiment Classification Using Feature Relation Networks , 2011, IEEE Transactions on Knowledge and Data Engineering.

[16]  Lillian Lee,et al.  Opinion Mining and Sentiment Analysis , 2008, Found. Trends Inf. Retr..

[17]  Fei Song,et al.  Comparison of Feature Selection Methods for Sentiment Analysis , 2010, Canadian Conference on AI.

[18]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[19]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[20]  Jin Zhang,et al.  An empirical study of sentiment analysis for chinese documents , 2008, Expert Syst. Appl..

[21]  Diego Reforgiato Recupero,et al.  Sentiment Analysis: Adjectives and Adverbs are Better than Adjectives Alone , 2007, ICWSM.

[22]  Deyu Li,et al.  A Feature Selection Method Based on Fisher's Discriminant Ratio for Text Sentiment Classification , 2009, WISM.