The Effect of Dataset Size on Training Tweet Sentiment Classifiers

Using automated methods of labeling tweet sentiment, large volumes of tweets can be labeled and used to train classifiers. Millions of tweets could be used to train a classifier, however, doing so is computationally expensive. Thus, it is valuable to establish how many tweets should be utilized to train a classifier, since using additional instances with no gain in performance is a waste of resources. In this study, we seek to find out how many tweets are needed before no significant improvements are observed for sentiment analysis when adding additional instances. We train and evaluate classifiers using C4.5 decision tree, Naïve Bayes, 5 Nearest Neighbor and Radial Basis Function Network, with seven datasets varying from 1000 to 243,000 instances. Models are trained using four runs of 5-fold cross validation. Additionally, we conduct statistical tests to verify our observations and examine the impact of limiting features using frequency. All learners were found to improve with dataset size, with Naïve Bayes being the best performing learner. We found that Naïve Bayes did not significantly benefit from using more than 81,000 instances. To the best of our knowledge, this is the first study to investigate how learners scale in respect to dataset size with results verified using statistical tests and multiple models trained for each learner and dataset size. Additionally, we investigated using feature frequency to greatly reduce data grid size with either a small increase or decrease in classifier performance depending on choice of learner.

[1]  Taghi M. Khoshgoftaar,et al.  Impact of Feature Selection Techniques for Tweet Sentiment Classification , 2015, FLAIRS.

[2]  Ari Rappoport,et al.  Enhanced Sentiment Learning Using Twitter Hashtags and Smileys , 2010, COLING.

[3]  Min Zhang Proceedings of the ACL 2012 System Demonstrations , 2012 .

[4]  Albert Bifet,et al.  Sentiment Knowledge Discovery in Twitter Streaming Data , 2010, Discovery Science.

[5]  Junlan Feng,et al.  Robust Sentiment Detection on Twitter from Biased and Noisy Data , 2010, COLING.

[6]  Shrikanth S. Narayanan,et al.  A System for Real-time Twitter Sentiment Analysis of 2012 U.S. Presidential Election Cycle , 2012, ACL.

[7]  Taghi M. Khoshgoftaar,et al.  Using Random Undersampling to Alleviate Class Imbalance on Tweet Sentiment Data , 2015, 2015 IEEE International Conference on Information Reuse and Integration.

[8]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[9]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[10]  Taghi M. Khoshgoftaar,et al.  Experimental perspectives on learning from imbalanced data , 2007, ICML '07.

[11]  Taghi M. Khoshgoftaar,et al.  Using Ensemble Learners to Improve Classifier Performance on Tweet Sentiment Data , 2015, 2015 IEEE International Conference on Information Reuse and Integration.

[12]  R. Freund,et al.  SAS for linear models : a guide to the ANOVA and GLM procedures , 1981 .

[13]  Ron Kohavi,et al.  The Case against Accuracy Estimation for Comparing Induction Algorithms , 1998, ICML.

[14]  Vaibhavi N Patodkar,et al.  Twitter as a Corpus for Sentiment Analysis and Opinion Mining , 2016 .