Large Scale Implementations for Twitter Sentiment Classification

Sentiment Analysis on Twitter Data is indeed a challenging problem due to the nature, diversity and volume of the data. People tend to express their feelings freely, which makes Twitter an ideal source for accumulating a vast amount of opinions towards a wide spectrum of topics. This amount of information offers huge potential and can be harnessed to receive the sentiment tendency towards these topics. However, since no one can invest an infinite amount of time to read through these tweets, an automated decision making approach is necessary. Nevertheless, most existing solutions are limited in centralized environments only. Thus, they can only process at most a few thousand tweets. Such a sample is not representative in order to define the sentiment polarity towards a topic due to the massive number of tweets published daily. In this work, we develop two systems: the first in the MapReduce and the second in the Apache Spark framework for programming with Big Data. The algorithm exploits all hashtags and emoticons inside a tweet, as sentiment labels, and proceeds to a classification method of diverse sentiment types in a parallel and distributed manner. Moreover, the sentiment analysis tool is based on Machine Learning methodologies alongside Natural Language Processing techniques and utilizes Apache Spark’s Machine learning library, MLlib. In order to address the nature of Big Data, we introduce some pre-processing steps for achieving better results in Sentiment Analysis as well as Bloom filters to compact the storage size of intermediate data and boost the performance of our algorithm. Finally, the proposed system was trained and validated with real data crawled by Twitter, and, through an extensive experimental evaluation, we prove that our solution is efficient, robust and scalable while confirming the quality of our sentiment identification.

[1]  Tadahiko Kumamoto,et al.  Role of Emoticons for Multidimensional Sentiment Analysis of Twitter , 2014, iiWAS.

[2]  Kyumin Lee,et al.  You are where you tweet: a content-based approach to geo-locating twitter users , 2010, CIKM.

[3]  Manish Singh,et al.  Efficient Twitter sentiment classification using subjective distant supervision , 2017, 2017 9th International Conference on Communication Systems and Networks (COMSNETS).

[4]  Ari Rappoport,et al.  Efficient Unsupervised Discovery of Word Categories Using Symmetric Patterns and High Frequency Words , 2006, ACL.

[5]  Dan Roth,et al.  Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM , 2015, ICML.

[6]  Tiejun Zhao,et al.  Target-dependent Twitter Sentiment Classification , 2011, ACL.

[7]  Preslav Nakov,et al.  SemEval-2013 Task 2: Sentiment Analysis in Twitter , 2013, *SEMEVAL.

[8]  Hong Yu,et al.  Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying the Polarity of Opinion Sentences , 2003, EMNLP.

[9]  M. Tahar Kechadi,et al.  Performance Evaluation of a Natural Language Processing Approach Applied in White Collar Crime Investigation , 2014, FDSE.

[10]  Umesh Hodeghatta Rao Xavier Sentiment analysis of Hollywood movies on Twitter , 2013, ASONAM.

[11]  Owen Rambow,et al.  Sentiment Analysis of Twitter Data , 2011 .

[12]  Giannis Tzimas,et al.  Large Scale Sentiment Analysis on Twitter with Spark , 2016, EDBT/ICDT Workshops.

[13]  Dimitrios Tsoumakos,et al.  kdANN+: A Rapid AkNN Classifier for Big Data , 2016, Trans. Large Scale Data Knowl. Centered Syst..

[14]  Patrick Wendell,et al.  Learning Spark: Lightning-Fast Big Data Analytics , 2015 .

[15]  Yulan He,et al.  Joint sentiment/topic model for sentiment analysis , 2009, CIKM.

[16]  Paolo Rosso,et al.  A multidimensional approach for detecting irony in Twitter , 2013, Lang. Resour. Evaluation.

[17]  Preslav Nakov,et al.  SemEval-2014 Task 9: Sentiment Analysis in Twitter , 2014, *SEMEVAL.

[18]  Athanasios K. Tsakalidis,et al.  An Apache Spark Implementation for Sentiment Analysis on Twitter Data , 2016, ALGOCLOUD.

[19]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[20]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[21]  Horacio Saggion,et al.  Modelling Irony in Twitter: Feature Analysis and Evaluation , 2014, LREC.

[22]  Xiaoyan Zhu,et al.  Movie review mining and summarization , 2006, CIKM '06.

[23]  Preslav Nakov,et al.  SemEval-2015 Task 10: Sentiment Analysis in Twitter , 2015, *SEMEVAL.

[24]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[25]  Janyce Wiebe,et al.  Articles: Recognizing Contextual Polarity: An Exploration of Features for Phrase-Level Sentiment Analysis , 2009, CL.

[26]  Erik Cambria,et al.  SeNTU: Sentiment Analysis of Tweets by Combining a Rule-based Classifier with Supervised Learning , 2015, *SEMEVAL.

[27]  Navneet Kaur,et al.  Opinion mining and sentiment analysis , 2016, 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom).

[28]  Cristina Bosco,et al.  Developing Corpora for Sentiment Analysis: The Case of Irony and Senti-TUT , 2013, IEEE Intelligent Systems.

[29]  Jeonghee Yi,et al.  Sentiment analysis: capturing favorability using natural language processing , 2003, K-CAP '03.

[30]  Giannis Tzimas,et al.  Using Hadoop for Large Scale Analysis on Twitter: A Technical Report , 2016, ArXiv.

[31]  Ioannis Hatzilygeroudis,et al.  Integrating User's Emotional Behavior for Community Detection in Social Networks , 2016, WEBIST.

[32]  Janyce Wiebe,et al.  Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis , 2005, HLT.

[33]  P. Gács,et al.  Algorithms , 1992 .

[34]  Peter D. Turney Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews , 2002, ACL.

[35]  Ioannis Hatzilygeroudis,et al.  Conversation Emotional Modeling in Social Networks , 2014, 2014 IEEE 26th International Conference on Tools with Artificial Intelligence.

[36]  Rajiv Ramnath,et al.  Towards building large-scale distributed systems for twitter sentiment analysis , 2012, SAC '12.

[37]  Thomas Gottron,et al.  Bad news travel fast: a content-based analysis of interestingness on Twitter , 2011, WebSci '11.

[38]  Xu Ling,et al.  Topic sentiment mixture: modeling facets and opinions in weblogs , 2007, WWW '07.

[39]  Xiaolong Wang,et al.  Topic sentiment analysis in twitter: a graph-based hashtag sentiment classification approach , 2011, CIKM '11.

[40]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[41]  Chih-Jen Lin,et al.  Distributed Newton Methods for Regularized Logistic Regression , 2015, PAKDD.

[42]  Bing Liu,et al.  Mining and summarizing customer reviews , 2004, KDD.

[43]  Marie-Francine Moens,et al.  A machine learning approach to sentiment analysis in multilingual Web texts , 2009, Information Retrieval.

[44]  Ari Rappoport,et al.  Enhanced Sentiment Learning Using Twitter Hashtags and Smileys , 2010, COLING.

[45]  Junlan Feng,et al.  Robust Sentiment Detection on Twitter from Biased and Noisy Data , 2010, COLING.

[46]  J. Meigs,et al.  WHO Technical Report , 1954, The Yale Journal of Biology and Medicine.

[47]  Wei Zhang,et al.  Opinion retrieval from blogs , 2007, CIKM '07.

[48]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[49]  Preslav Nakov,et al.  SemEval-2016 Task 4: Sentiment Analysis in Twitter. , 2019 .

[50]  Nina Wacholder,et al.  Identifying Sarcasm in Twitter: A Closer Look , 2011, ACL.

[51]  Bing Liu,et al.  The utility of linguistic rules in opinion mining , 2007, SIGIR.