Feature Subsumption for Sentiment Classification in Multiple Languages

An open problem in machine learning-based sentiment classification is how to extract complex features that outperform simple features; figuring out which types of features are most valuable is another Most of the studies focus primarily on character or word Ngrams features, but substring-group features have never been considered in sentiment classification area before In this study, the substring-group features are extracted and selected for sentiment classification by means of transductive learning-based algorithm To demonstrate generality, experiments have been conducted on three open datasets in three different languages: Chinese, English and Spanish The experimental results show that the proposed algorithm's performance is usually superior to the best performance in related work, and the proposed feature subsumption algorithm for sentiment classification is multilingual Compared to the inductive learning-based algorithm, the experimental results also illustrate that the transductive learning-based algorithm can significantly improve the performance of sentiment classification As for term weighting, the experiments show that the “tfidf-c” outperforms all other term weighting approaches in the proposed algorithm.

[1]  Maosong Sun,et al.  Experimental Study on Sentiment Classification of Chinese Review using Machine Learning Techniques , 2007, 2007 International Conference on Natural Language Processing and Knowledge Engineering.

[2]  Jin Zhang,et al.  An empirical study of sentiment analysis for chinese documents , 2008, Expert Syst. Appl..

[3]  Lillian Lee,et al.  Opinion Mining and Sentiment Analysis , 2008, Found. Trends Inf. Retr..

[4]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[5]  Wessel Kraaij,et al.  A Shallow Approach to Subjectivity Classification , 2008, ICWSM.

[6]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[7]  Ramesh Nallapati,et al.  A Comparative Study of Methods for Transductive Transfer Learning , 2007 .

[8]  Michael L. Littman,et al.  Unsupervised Learning of Semantic Orientation from a Hundred-Billion-Word Corpus , 2002, ArXiv.

[9]  John Blitzer,et al.  Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification , 2007, ACL.

[10]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[11]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[12]  Bo Pang,et al.  A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , 2004, ACL.

[13]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[14]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[15]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[16]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[17]  Alistair Kennedy,et al.  SENTIMENT CLASSIFICATION of MOVIE REVIEWS USING CONTEXTUAL VALENCE SHIFTERS , 2006, Comput. Intell..

[18]  Soo-Min Kim,et al.  Determining the Sentiment of Opinions , 2004, COLING.

[19]  Songbo Tan,et al.  Combining learn-based and lexicon-based techniques for sentiment detection without using labeled examples , 2008, SIGIR '08.

[20]  Philip S. Yu,et al.  A holistic lexicon-based approach to opinion mining , 2008, WSDM '08.

[21]  Mikhail Belkin,et al.  Beyond the point cloud: from transductive to semi-supervised learning , 2005, ICML.

[22]  Peter D. Turney Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews , 2002, ACL.

[23]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[24]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[25]  Dell Zhang,et al.  Extracting key-substring-group features for text classification , 2006, KDD '06.

[26]  Franco Turini,et al.  Time-Annotated Sequences for Medical Data Mining , 2007 .

[27]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[28]  David M. Pennock,et al.  Mining the peanut gallery: opinion extraction and semantic classification of product reviews , 2003, WWW '03.

[29]  Bing Liu,et al.  Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data , 2006, Data-Centric Systems and Applications.

[30]  Nigel Collier,et al.  Sentiment Analysis using Support Vector Machines with Diverse Information Sources , 2004, EMNLP.

[31]  Bing Liu,et al.  Mining Opinion Features in Customer Reviews , 2004, AAAI.

[32]  Siddharth Patwardhan,et al.  Feature Subsumption for Opinion Analysis , 2006, EMNLP.

[33]  Khurshid Ahmad,et al.  Sentiment Polarity Identification in Financial News: A Cohesion-based Approach , 2007, ACL.

[34]  Xiaojun Wan,et al.  Using Bilingual Knowledge and Ensemble Techniques for Unsupervised Chinese Sentiment Analysis , 2008, EMNLP.