Text classification for automatic detection of alcohol use-related tweets: A feasibility study

We present a feasibility study using text classification to classify tweets about alcohol use. Alcohol use is the most widely used substance in the US and is the leading risk factor for premature morbidity and mortality globally. Understanding use patterns and locations is an important step toward prevention, moderation, and control of alcohol outlets. Social media may provide an alternate way to measure alcohol use in real time. This feasibility study explores text classification methodologies for identifying alcohol use tweets. We labeled 34,563 geo-located New York City tweets collected in a 24 hour period over New Year's Day 2012. We preprocessed the tweets into stem/ not stemmed and unigram/ bigram representations. We then applied multinomial naïve Bayes, a linear SVM, Bayesian logistic regression, and random forests to the classification task. Using 10 fold cross-validation, the algorithms performed with area under the receiver operating curve of 0.66, 0.91, 0.93, and 0.94 respectively. We also compare to a human constructed Boolean search for the same tweets and the text classification method is competitive with this hand crafted search. In conclusion, we show that the task of automatically identifying alcohol related tweets is highly feasible and paves the way for future research to improve these classifiers.

[1]  Michael J. Paul,et al.  National and Local Influenza Surveillance through Twitter: An Analysis of the 2012-2013 Influenza Epidemic , 2013, PloS one.

[2]  Giovanni Seni,et al.  Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions , 2010, Ensemble Methods in Data Mining.

[3]  Bernice W. Polemis Nonparametric Statistics for the Behavioral Sciences , 1959 .

[4]  Klaus Nordhausen,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition by Trevor Hastie, Robert Tibshirani, Jerome Friedman , 2009 .

[5]  W. Chapman,et al.  Using Twitter to Examine Smoking Behavior and Perceptions of Emerging Tobacco Products , 2013, Journal of medical Internet research.

[6]  Aron Culotta,et al.  Lightweight methods to estimate influenza rates and alcohol sales volume from Twitter messages , 2012, Language Resources and Evaluation.

[7]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[8]  Lawrence D. Fu,et al.  A comprehensive empirical comparison of modern supervised classification and feature selection methods for text categorization , 2014, J. Assoc. Inf. Sci. Technol..

[9]  Geoff Holmes,et al.  Multinomial Naive Bayes for Text Categorization Revisited , 2004, Australian Conference on Artificial Intelligence.

[10]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[11]  David Madigan,et al.  Large-Scale Bayesian Logistic Regression for Text Categorization , 2007, Technometrics.

[12]  J T Ungerleider Drug use and health. , 1970, AORN journal.

[13]  Yindalon Aphinyanagphongs,et al.  Text Categorization Models for Retrieval of High Quality Articles in Internal Medicine , 2003, AMIA.

[14]  Patrick Paroubek,et al.  Twitter as a Corpus for Sentiment Analysis and Opinion Mining , 2010, LREC.

[15]  Niels Taatgen,et al.  Proceedings of the Tenth International Conference on Language Resources and Evaluation , 2016, LREC 2016.

[16]  B. Minasny The Elements of Statistical Learning, Second Edition, Trevor Hastie, Robert Tishirani, Jerome Friedman. (2009), Springer Series in Statistics, ISBN 0172-7397, 745 pp , 2009 .

[17]  Johanna D. Moore,et al.  Twitter Sentiment Analysis: The Good the Bad and the OMG! , 2011, ICWSM.

[18]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.