Categorical Probability Proportion Difference (CPPD): A Feature Selection Method for Sentiment Classification

Sentiment analysis is to extract the opinion of the user from of the text documents. Sentiment classification using machine learning methods face problem of handling huge number of unique terms in a feature vector for the classification. Thus it is required to eliminate the irrelevant and noisy terms from the feature vector. Feature selection methods reduce the feature size by selecting prominent features for better classification. In this paper, a new feature selection method namely Probability Proportion Difference (PPD) is proposed which is based on the probability of belongingness of a term to a particular class. It is capable of removing irrelevant terms from the feature vector. Further, a Categorical Probability Proportion Difference (CPPD) feature selection method is proposed based on Probability Proportion Difference (PPD) and Categorical Proportion Difference (CPD). CPPD feature selection method is able to select the features which are relevant and capable of discriminating the class. The performance of the proposed feature selection methods is compared with the CPD method and Information Gain (IG) method which has been identified as one of the best feature selection method for sentiment classification. Experimentation of proposed feature selection methods was performed on two standard datasets viz. movie review dataset and product review (i.e. book) dataset. Experimental results show that proposed CPPD feature selection method outperforms other feature selection method for sentiment classification.

[1]  Jin Zhang,et al.  An empirical study of sentiment analysis for chinese documents , 2008, Expert Syst. Appl..

[2]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[3]  John Blitzer,et al.  Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification , 2007, ACL.

[4]  Robert J. Hilderman,et al.  Categorical Proportional Difference: A Feature Selection Method for Text Categorization , 2008, AusDM.

[5]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[6]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[7]  Bo Pang,et al.  A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , 2004, ACL.

[8]  Lin Dai,et al.  Improving Sentiment Classification Using Feature Highlighting and Feature Bagging , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[9]  Timothy O'Keefe Feature Selection and Weighting Methods in Sentiment Analysis , 2009 .

[10]  Deyu Li,et al.  A feature selection method based on improved fisher's discriminant ratio for text sentiment classification , 2011, Expert Syst. Appl..

[11]  Deyu Li,et al.  A Feature Selection Method Based on Fisher's Discriminant Ratio for Text Sentiment Classification , 2009, WISM.

[12]  Hsinchun Chen,et al.  Sentiment analysis in multiple languages: Feature selection for opinion classification in Web forums , 2008, TOIS.