Integrating Multiple Data Sources to Enhance Sentiment Prediction

Understanding the sentiment conveyed by a person is an important part of any social interaction, and sentiment in text can provide valuable insight into an author's opinion. Sentiment analysis for text is a large field of research within machine learning, as it allows the sentiment of large numbers of text instances to be determined and used to answer various questions, such as election prediction. Typically, a sentiment classifier is trained using data from the same domain it is intended to be applied to; however, there may not be sufficient training data within the given domain. Additionally, using data from multiple sources, including other related domains, may help create a more generalized sentiment classifier that can be applied to multiple domains. To this aim, we conduct an empirical study using sentiment data from two sources, online reviews and tweets. We first test the performance of sentiment analysis models built using a single data source for both in-domain and cross-domain classification. Then, we evaluate classifiers trained using instances randomly sampled from both sources. Additionally, we evaluate sampling different quantities of instances from both data sources to determine how many instances should be included in a training data set. We apply statistical tests to verify the significance of our results and find that using a combination of instances from reviews and tweets is similar to, or better than any model trained from a single domain. Also, we found no significant difference in performance for classifiers 100,000 or more combined training instances. These results are important as they indicate a more robust classifier can be trained by using a smaller number of in-domain instances augmented with instances from a related domain, rather than using purely in-domain instances. Thus, we recommend using a training data set composed of both tweets and reviews, when training a sentiment classifier for use in predicting both tweet and review sentiment.

[1]  David M. Levine,et al.  Intermediate Statistical Methods and Applications: A Computer Package Approach , 1982 .

[2]  Taghi M. Khoshgoftaar,et al.  Enhancing Ensemble Learners with Data Sampling on High-Dimensional Imbalanced Tweet Sentiment Data , 2016, FLAIRS.

[3]  Taghi M. Khoshgoftaar,et al.  Survey of review spam detection using machine learning techniques , 2015, Journal of Big Data.

[4]  Shrikanth S. Narayanan,et al.  A System for Real-time Twitter Sentiment Analysis of 2012 U.S. Presidential Election Cycle , 2012, ACL.

[5]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[6]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[7]  Taghi M. Khoshgoftaar,et al.  Using Feature Selection in Combination with Ensemble Learning Techniques to Improve Tweet Sentiment Classification Performance , 2015, 2015 IEEE 27th International Conference on Tools with Artificial Intelligence (ICTAI).

[8]  Xiaohui Yu,et al.  ARSA: a sentiment-aware model for predicting sales performance using blogs , 2007, SIGIR.

[9]  Taghi M. Khoshgoftaar,et al.  Reducing Feature Set Explosion to Facilitate Real-World Review Spam Detection , 2016, FLAIRS Conference.

[10]  Taghi M. Khoshgoftaar,et al.  Cross-Domain Sentiment Analysis: An Empirical Investigation , 2016, 2016 IEEE 17th International Conference on Information Reuse and Integration (IRI).

[11]  Lillian Lee,et al.  Opinion Mining and Sentiment Analysis , 2008, Found. Trends Inf. Retr..

[12]  Claire Cardie,et al.  Towards a General Rule for Identifying Deceptive Opinion Spam , 2014, ACL.

[13]  Philipp Koehn,et al.  Synthesis Lectures on Human Language Technologies , 2016 .

[14]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[15]  Jure Leskovec,et al.  Inferring Networks of Substitutable and Complementary Products , 2015, KDD.

[16]  Preslav Nakov,et al.  SemEval-2014 Task 9: Sentiment Analysis in Twitter , 2014, *SEMEVAL.

[17]  Catie Meador,et al.  Analyzing the Relationship Between Tweets , Box-Office Performance , and Stocks , 2010 .

[18]  Bing Liu,et al.  Sentiment Analysis and Opinion Mining , 2012, Synthesis Lectures on Human Language Technologies.

[19]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.