Bias-aware lexicon-based sentiment analysis

Sentiment analysis of textual content is widely used for automatic summarization of opinions and sentiments expressed by people. With the growing popularity of social media and user-generated content, efficient and effective sentiment analysis is critical to businesses and governments. Lexicon-based methods provide efficiency through their manually developed affective word lists and valence values. However, the predictions of such methods can be biased towards positive or negative polarity thus distorting the analysis. In this paper, we propose Bias-Aware Thresholding (BAT), an approach that can be combined with any lexicon-based method to make it bias-aware. BAT is motivated from cost-sensitive learning where the prediction threshold is changed to reduce prediction error bias. We formally define bias in polarity predictions and present a measure for quantifying it. We evaluate BAT in combination with AFINN and SentiStrength -- two popular lexicon-based methods -- on seven real-world datasets. The results show that bias reduces smoothly with an increase in the absolute value of the threshold, and accuracy increases as well in most cases. We demonstrate that the threshold can be learned reliably from a very small number of labeled examples, and supervised classifiers learned on such small datasets produce poorer bias and accuracy performances.

[1]  Toon Calders,et al.  Three naive Bayes approaches for discrimination-free classification , 2010, Data Mining and Knowledge Discovery.

[2]  Johan Bollen,et al.  Modeling Public Mood and Emotion: Twitter Sentiment and Socio-Economic Phenomena , 2009, ICWSM.

[3]  Shrikanth S. Narayanan,et al.  A System for Real-time Twitter Sentiment Analysis of 2012 U.S. Presidential Election Cycle , 2012, ACL.

[4]  Andrea Esuli,et al.  SENTIWORDNET: A Publicly Available Lexical Resource for Opinion Mining , 2006, LREC.

[5]  Xiangliang Zhang,et al.  Decision Theory for Discrimination-Aware Classification , 2012, 2012 IEEE 12th International Conference on Data Mining.

[6]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[7]  Peter D. Turney Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews , 2002, ACL.

[8]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[9]  Erik Cambria,et al.  SenticNet: A Publicly Available Semantic Resource for Opinion Mining , 2010, AAAI Fall Symposium: Commonsense Knowledge.

[10]  Jun Sakuma,et al.  Fairness-aware Learning through Regularization Approach , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[11]  J. Pennebaker,et al.  The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods , 2010 .

[12]  Franco Turini,et al.  Discrimination-aware data mining , 2008, KDD.

[13]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[14]  Fabrício Benevenuto,et al.  Comparing and combining sentiment analysis methods , 2013, COSN '13.

[15]  Mike Thelwall,et al.  Sentiment in short strength detection informal text , 2010 .

[16]  Franco Turini,et al.  k-NN as an implementation of situation testing for discrimination discovery and prevention , 2011, KDD.

[17]  Finn Årup Nielsen,et al.  A New ANEW: Evaluation of a Word List for Sentiment Analysis in Microblogs , 2011, #MSM.

[18]  ThelwallMike,et al.  Sentiment strength detection in short informal text , 2010 .

[19]  Peter D. Turney Thumbs Up, Thumbs Down , 2013, Journal of Cell Science.