A comparative evaluation of pre-processing techniques and their interactions for twitter sentiment analysis

Abstract Pre-processing is the first step in text classification, and choosing right pre-processing techniques can improve classification effectiveness. We experimentally compare 16 commonly used pre-processing techniques on two Twitter datasets for Sentiment Analysis, employing four popular machine learning algorithms, namely, Linear SVC, Bernoulli Naive Bayes, Logistic Regression, and Convolutional Neural Networks. We evaluate the pre-processing techniques on their resulting classification accuracy and number of features they produce. We find that techniques like lemmatization, removing numbers, and replacing contractions, improve accuracy, while others like removing punctuation do not. Finally, in order to investigate interactions—desirable or otherwise—between the techniques when they are employed simultaneously in a pipeline fashion, an ablation and combination study is contacted. The results of ablation and combination show the significance of techniques such as replacing numbers and replacing repetitions of punctuation.

[1]  G. S. Mahalakshmi,et al.  Twitter Sentiment Analysis for Large-Scale Data: An Unsupervised Approach , 2014, Cognitive Computation.

[2]  Ikuya Yamada,et al.  Enhancing Named Entity Recognition in Twitter Messages Using Entity Linking , 2015, NUT@IJCNLP.

[3]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[4]  Avi Arampatzis,et al.  A Comparison of Pre-processing Techniques for Twitter Sentiment Analysis , 2017, TPDL.

[5]  Junjie Lin,et al.  Personality-based refinement for sentiment classification in microblog , 2017, Knowl. Based Syst..

[6]  Julio Gonzalo,et al.  Sentiment Propagation for Predicting Reputation Polarity , 2017, ECIR.

[7]  Alexandra Balahur,et al.  Sentiment Analysis in Social Media Texts , 2013, WASSA@NAACL-HLT.

[8]  Junlan Feng,et al.  Robust Sentiment Detection on Twitter from Biased and Noisy Data , 2010, COLING.

[9]  Boi Faltings,et al.  A :) Is Worth a Thousand Words: How People Attach Sentiment to Emoticons and Words in Tweets , 2013, 2013 International Conference on Social Computing.

[10]  Owen Rambow,et al.  Sentiment Analysis of Twitter Data , 2011 .

[11]  Yong Shi,et al.  The Role of Text Pre-processing in Sentiment Analysis , 2013, ITQM.

[12]  Preslav Nakov,et al.  SemEval-2013 Task 2: Sentiment Analysis in Twitter , 2013, *SEMEVAL.

[13]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[14]  Harith Alani,et al.  Automatically Extracting Polarity-Bearing Topics for Cross-Domain Sentiment Classification , 2011, ACL.

[15]  Mike Thelwall,et al.  Sentiment strength detection for the social web , 2012, J. Assoc. Inf. Sci. Technol..

[16]  Christopher D. Manning,et al.  Baselines and Bigrams: Simple, Good Sentiment and Topic Classification , 2012, ACL.

[17]  Ahmed H. Yousef,et al.  Component analysis of a Sentiment Analysis framework on different corpora , 2014, 2014 9th International Conference on Computer Engineering & Systems (ICCES).

[18]  Yulan He,et al.  Joint sentiment/topic model for sentiment analysis , 2009, CIKM.

[19]  Johanna D. Moore,et al.  Twitter Sentiment Analysis: The Good the Bad and the OMG! , 2011, ICWSM.

[20]  Serkan Günal,et al.  The impact of preprocessing on text classification , 2014, Inf. Process. Manag..

[21]  Daniel Dajun Zeng,et al.  Twitter Sentiment Analysis: A Bootstrap Ensemble Framework , 2013, 2013 International Conference on Social Computing.

[22]  Norisma Idris,et al.  Toward Tweets Normalization Using Maximum Entropy , 2015, NUT@IJCNLP.

[23]  Ming Zhou,et al.  Coooolll: A Deep Learning System for Twitter Sentiment Classification , 2014, *SEMEVAL.

[24]  Grzegorz Kondrak,et al.  A Comparison of Sentiment Analysis Techniques: Polarizing Movie Blogs , 2008, Canadian Conference on AI.

[25]  Joel D. Martin,et al.  Sentiment, emotion, purpose, and style in electoral tweets , 2015, Inf. Process. Manag..

[26]  François-Régis Chaumartin,et al.  UPAR7: A knowledge-based system for headline sentiment tagging , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).

[27]  Janyce Wiebe,et al.  Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis , 2005, HLT.

[28]  Zhihua Zhang,et al.  ECNU: Multi-level Sentiment Analysis on Twitter Using Traditional Linguistic Features and Word Embedding Features , 2015, *SEMEVAL.

[29]  Tao Chen,et al.  Improving sentiment analysis via sentence type classification using BiLSTM-CRF and CNN , 2017, Expert Syst. Appl..

[30]  Cícero Nogueira dos Santos,et al.  Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts , 2014, COLING.

[31]  Lijuan Wang,et al.  The Role of Pre-processing in Twitter Sentiment Analysis , 2014, ICIC.

[32]  Tajinder Singh,et al.  Role of Text Pre-processing in Twitter Sentiment Analysis , 2016 .

[33]  Alessandro Moschitti,et al.  Twitter Sentiment Analysis with Deep Convolutional Neural Networks , 2015, SIGIR.

[34]  Yong Qi,et al.  Dual Sentiment Analysis: Considering Two Sides of One Review , 2015, IEEE Transactions on Knowledge and Data Engineering.

[35]  Saif Mohammad,et al.  NRC-Canada: Building the State-of-the-Art in Sentiment Analysis of Tweets , 2013, *SEMEVAL.

[36]  Rafael Muñoz,et al.  UMCC_DLSI: Sentiment Analysis in Twitter using Polirity Lexicons and Tweet Similarity , 2014, *SEMEVAL.

[37]  Vivek Narayanan,et al.  Fast and Accurate Sentiment Classification Using an Enhanced Naive Bayes Model , 2013, IDEAL.

[38]  Patrick Paroubek,et al.  Twitter as a Corpus for Sentiment Analysis and Opinion Mining , 2010, LREC.

[39]  Usman Qamar,et al.  TOM: Twitter opinion mining framework using hybrid classification scheme , 2014, Decis. Support Syst..

[40]  Santanu Kumar Rath,et al.  Classification of sentiment reviews using n-gram machine learning approach , 2016, Expert Syst. Appl..

[41]  J. Fernando Sánchez-Rada,et al.  Enhancing deep learning sentiment analysis with ensemble techniques in social applications , 2020 .

[42]  Ming Zhou,et al.  Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification , 2014, ACL.

[43]  A. Smeaton,et al.  On Using Twitter to Monitor Political Sentiment and Predict Election Results , 2011 .

[44]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[45]  John Atkinson,et al.  Improving opinion retrieval in social media by combining features-based coreferencing and memory-based learning , 2015, Inf. Sci..

[46]  Boumediene Belkhouche,et al.  Semantic Twitter sentiment analysis based on a fuzzy thesaurus , 2018, Soft Comput..

[47]  Hiroya Takamura,et al.  Sentiment Classification Using Word Sub-sequences and Dependency Sub-trees , 2005, PAKDD.

[48]  Wei Wu,et al.  Automatic Generation of Personalized Annotation Tags for Twitter Users , 2010, NAACL.

[49]  Paulo Cortez,et al.  The impact of microblogging data for stock market prediction: Using Twitter to predict returns, volatility, trading volume and survey sentiment indices , 2017 .

[50]  Gui Xiaolin,et al.  Comparison Research on Text Pre-processing Methods on Twitter Sentiment Analysis , 2017, IEEE Access.

[51]  Rob Malouf,et al.  A Preliminary Investigation into Sentiment Analysis of Informal Political Discourse , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[52]  Tobias Günther,et al.  GU-MLT-LT: Sentiment Analysis of Short Messages using Linguistic Features and Stochastic Gradient Descent , 2013, *SEMEVAL.

[53]  Harith Alani,et al.  Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new dataset, the STS-Gold , 2013, ESSEM@AI*IA.

[54]  Syin Chan,et al.  Effectiveness of Simple Linguistic Processing in Automatic Sentiment Classification of Product Reviews , 2004 .

[55]  Padmini Srinivasan,et al.  Exploring Feature Definition and Selection for Sentiment Classifiers , 2011, ICWSM.

[56]  Vladimir Cherkassky,et al.  The Nature Of Statistical Learning Theory , 1997, IEEE Trans. Neural Networks.

[57]  Hang Li,et al.  Convolutional Neural Network Architectures for Matching Natural Language Sentences , 2014, NIPS.

[58]  Fangzhao Wu,et al.  Domain-specific sentiment classification via fusing sentiment knowledge from multiple sources , 2017, Inf. Fusion.

[59]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[60]  Jorge A. Balazs,et al.  Opinion Mining and Information Fusion: A survey , 2016, Inf. Fusion.

[61]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL 2006.

[62]  Zixue Cheng,et al.  CNN for situations understanding based on sentiment analysis of twitter data , 2017 .

[63]  Walid Maalej,et al.  How Do Users Like This Feature? A Fine Grained Sentiment Analysis of App Reviews , 2014, 2014 IEEE 22nd International Requirements Engineering Conference (RE).

[64]  Avi Arampatzis,et al.  DUTH at SemEval-2017 Task 4: A Voting Classification Approach for Twitter Sentiment Analysis , 2017, SemEval@ACL.

[65]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[66]  Gregory Piatetsky-Shapiro,et al.  Summary from the KDD-03 panel: data mining: the next 10 years , 2003, SKDD.

[67]  Zhao Jianqiang,et al.  Pre-processing Boosting Twitter Sentiment Analysis? , 2015, 2015 IEEE International Conference on Smart City/SocialCom/SustainCom (SmartCity).

[68]  Seong Joon Yoo,et al.  Senti-lexicon and improved Naïve Bayes algorithms for sentiment analysis of restaurant reviews , 2012, Expert Syst. Appl..

[69]  Krishna P. Gummadi,et al.  Measuring User Influence in Twitter: The Million Follower Fallacy , 2010, ICWSM.

[70]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[71]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.