Sentiment Polarity Classification Using Statistical Data Compression Models

With growing availability and popularity of user generated content, the discipline of sentiment analysis has come to the attention of many researchers. Existing work has mainly focused on either knowledge based methods or standard machine learning techniques. In this paper we investigate sentiment polarity classification based on adaptive statistical data compression models. We evaluate the classification performance of the loss less compression algorithm Prediction by Partial Matching (PPM) as well as compression based measures using PPM-like character n-gram frequency statistics. Comprehensive experiments on three corpora show that compression based methods are efficient, easy to apply and can compete with the accuracy of sophisticated classifiers such as support vector machines.

[1]  Bo Pang,et al.  A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , 2004, ACL.

[2]  Joshua Goodman Extended Comment on Language Trees and Zipping , 2002, ArXiv.

[3]  Alison Huettner,et al.  Fuzzy Typing for Document Management , 2000 .

[4]  James Mayfield,et al.  Addressing morphological variation in alphabetic languages , 2009, SIGIR.

[5]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[6]  Vittorio Loreto,et al.  Language trees and zipping. , 2002, Physical review letters.

[7]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..

[8]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[9]  Nitin Thaper,et al.  Using compression for source-based classification of text , 2001 .

[10]  John Blitzer,et al.  Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification , 2007, ACL.

[11]  Warren Sack,et al.  On the Computation of Point of View , 1994, AAAI.

[12]  William John Teahan,et al.  Context-based methods for text categorisation , 2004, SIGIR '04.

[13]  Timothy O'Keefe Feature Selection and Weighting Methods in Sentiment Analysis , 2009 .

[14]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[15]  Peter D. Turney Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews , 2002, ACL.

[16]  Ian H. Witten,et al.  Text categorization using compression models , 2000, Proceedings DCC 2000. Data Compression Conference.

[17]  Q. Mcnemar Note on the sampling error of the difference between correlated proportions or percentages , 1947, Psychometrika.

[18]  Erik Cambria,et al.  Sentic Computing: Techniques, Tools, and Applications , 2012 .

[19]  Bing Liu,et al.  Sentiment Analysis and Opinion Mining , 2012, Synthesis Lectures on Human Language Technologies.

[20]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[21]  Mike Thelwall,et al.  A Study of Information Retrieval Weighting Schemes for Sentiment Analysis , 2010, ACL.

[22]  William John Teahan,et al.  A repetition based measure for verification of text collections and for text categorization , 2003, SIGIR.

[23]  Lillian Lee,et al.  Opinion Mining and Sentiment Analysis , 2008, Found. Trends Inf. Retr..

[24]  David J. Harper,et al.  Using compression based language models for text categorization. , 2003 .

[25]  Dmitry A. Shkarin,et al.  PPM: one step to practicality , 2002, Proceedings DCC 2002. Data Compression Conference.

[26]  Josef Kittler,et al.  Pattern recognition : a statistical approach , 1982 .

[27]  Dmitry V. Khmelev,et al.  Using Literal and Grammatical Statistics for Authorship Attribution , 2001, Probl. Inf. Transm..