Machine Learning Classifiers for Twitter Surveillance of Vaping: Comparative Machine Learning Study

Background Twitter presents a valuable and relevant social media platform to study the prevalence of information and sentiment on vaping that may be useful for public health surveillance. Machine learning classifiers that identify vaping-relevant tweets and characterize sentiments in them can underpin a Twitter-based vaping surveillance system. Compared with traditional machine learning classifiers that are reliant on annotations that are expensive to obtain, deep learning classifiers offer the advantage of requiring fewer annotated tweets by leveraging the large numbers of readily available unannotated tweets. Objective This study aims to derive and evaluate traditional and deep learning classifiers that can identify tweets relevant to vaping, tweets of a commercial nature, and tweets with provape sentiments. Methods We continuously collected tweets that matched vaping-related keywords over 2 months from August 2018 to October 2018. From this data set of tweets, a set of 4000 tweets was selected, and each tweet was manually annotated for relevance (vape relevant or not), commercial nature (commercial or not), and sentiment (provape or not). Using the annotated data, we derived traditional classifiers that included logistic regression, random forest, linear support vector machine, and multinomial naive Bayes. In addition, using the annotated data set and a larger unannotated data set of tweets, we derived deep learning classifiers that included a convolutional neural network (CNN), long short-term memory (LSTM) network, LSTM-CNN network, and bidirectional LSTM (BiLSTM) network. The unannotated tweet data were used to derive word vectors that deep learning classifiers can leverage to improve performance. Results LSTM-CNN performed the best with the highest area under the receiver operating characteristic curve (AUC) of 0.96 (95% CI 0.93-0.98) for relevance, all deep learning classifiers including LSTM-CNN performed better than the traditional classifiers with an AUC of 0.99 (95% CI 0.98-0.99) for distinguishing commercial from noncommercial tweets, and BiLSTM performed the best with an AUC of 0.83 (95% CI 0.78-0.89) for provape sentiment. Overall, LSTM-CNN performed the best across all 3 classification tasks. Conclusions We derived and evaluated traditional machine learning and deep learning classifiers to identify vaping-related relevant, commercial, and provape tweets. Overall, deep learning classifiers such as LSTM-CNN had superior performance and had the added advantage of requiring no preprocessing. The performance of these classifiers supports the development of a vaping surveillance system.

[1]  Suzan Burton,et al.  Competing Voices: Marketing and Counter-Marketing Alcohol on Twitter , 2013 .

[2]  Jennifer Duke,et al.  Methodological considerations in analyzing Twitter data. , 2013, Journal of the National Cancer Institute. Monographs.

[3]  Laura Kann,et al.  Youth Risk Behavior Surveillance--United States, 1993. CDC Surveillance Summaries. , 1995 .

[4]  Nikhil Ketkar,et al.  Deep Learning with Python , 2017 .

[5]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[6]  Andrzej Sobczak,et al.  Secondhand exposure to vapors from electronic cigarettes. , 2014, Nicotine & tobacco research : official journal of the Society for Research on Nicotine and Tobacco.

[7]  Kar-Hai Chu,et al.  Toward Real-Time Infoveillance of Twitter Health Messages. , 2018, American journal of public health.

[8]  Kar-Hai Chu,et al.  I wake up and hit the JUUL: Analyzing Twitter for JUUL nicotine effects and dependence. , 2019, Drug and alcohol dependence.

[9]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[10]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[11]  S. Diallo,et al.  Temporal and spatiotemporal investigation of tourist attraction visit sentiment on Twitter , 2018, PloS one.

[12]  Scott H. Burton,et al.  An Exploration of Social Circles and Prescription Drug Abuse Through Twitter , 2013, Journal of medical Internet research.

[13]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[14]  Zainab Farzal,et al.  The Adolescent Vaping Epidemic in the United States-How It Happened and Where We Go From Here. , 2019, JAMA otolaryngology-- head & neck surgery.

[15]  Noah A. Smith,et al.  Improved Transition-based Parsing by Modeling Characters instead of Words with LSTMs , 2015, EMNLP.

[16]  Jingcheng Du,et al.  Leveraging machine learning-based approaches to assess human papillomavirus vaccination sentiment trends with Twitter data , 2017, BMC Medical Informatics and Decision Making.

[17]  Noah A. Smith,et al.  World Vaping Day: Contextualizing Vaping Culture in Online Social Media Using a Mixed Methods Approach , 2019 .

[18]  Arthur Spirling,et al.  Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It , 2017, Political Analysis.

[19]  Jeffery L. Painter,et al.  Social Media Listening for Routine Post-Marketing Safety Surveillance , 2016, Drug Safety.

[20]  Jingcheng Du,et al.  Optimization on machine learning based approaches for sentiment analysis on HPV vaccines related tweets , 2017, Journal of Biomedical Semantics.

[21]  Ramakanth Kavuluru,et al.  Exploratory Analysis of Marketing and Non-marketing E-cigarette Themes on Twitter , 2016, SocInfo.

[22]  Mark T Gladwin,et al.  Vaping-Associated Acute Lung Injury: A Case Series. , 2019, American journal of respiratory and critical care medicine.

[23]  Chandler McClellan,et al.  Using social media to monitor mental health discussions − evidence from Twitter , 2017, J. Am. Medical Informatics Assoc..

[24]  Connie Lim,et al.  Youth risk behavior surveillance - United States, 2009. , 2010, Morbidity and mortality weekly report. Surveillance summaries.

[25]  Christophe G. Giraud-Carrier,et al.  Identifying Health-Related Topics on Twitter - An Exploration of Tobacco-Related Tweets as a Test Topic , 2011, SBP.

[26]  Elena M. Auer,et al.  Detecting Deceptive Impression Management Behaviors in Interviews Using Natural Language Processing , 2018 .

[27]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[28]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[29]  S. Emery,et al.  A cross-sectional examination of marketing of electronic cigarettes on Twitter , 2014, Tobacco Control.

[30]  Philip M. Massey,et al.  Applying Multiple Data Collection Tools to Quantify Human Papillomavirus Vaccine Communication on Twitter , 2016, Journal of medical Internet research.

[31]  Margaret Cress,et al.  Estimated Ages of JUUL Twitter Followers. , 2019, JAMA pediatrics.

[32]  N. Rigotti Balancing the Benefits and Harms of E-Cigarettes: A National Academies of Science, Engineering, and Medicine Report , 2018, Annals of Internal Medicine.

[33]  S. Diallo,et al.  You Are What You Tweet: Connecting the Geographic Variation in America’s Obesity Rate to Twitter Content , 2015, PloS one.

[34]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[35]  Wei Xu,et al.  Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[36]  Aron Culotta,et al.  A demographic and sentiment analysis of e-cigarette messages on Twitter , 2015 .

[37]  Christina D Diaz,et al.  Pulmonary Illness Related to E-Cigarette Use. , 2019, The New England journal of medicine.

[38]  Gilles Louppe,et al.  Scikit-learn: Machine Learning Without Learning the Machinery , 2015, GETMBL.

[39]  G. Eysenbach Infodemiology and Infoveillance: Framework for an Emerging Set of Public Health Informatics Methods to Analyze Search, Communication and Publication Behavior on the Internet , 2009, Journal of medical Internet research.

[40]  Mona T. Diab,et al.  Rumor Detection and Classification for Twitter Data , 2015, ArXiv.

[41]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL 2006.

[42]  W. Chapman,et al.  Using Twitter to Examine Smoking Behavior and Perceptions of Emerging Tobacco Products , 2013, Journal of medical Internet research.

[43]  P. Návrat,et al.  Exploratory Search on Twitter Utilizing User Feedback and Multi-Perspective Microblog Analysis , 2013, PloS one.

[44]  Fan Yu,et al.  Towards large-scale twitter mining for drug-related adverse events , 2012, SHB '12.

[45]  Mary Schwarz,et al.  Assessing Electronic Cigarette-Related Tweets for Sentiment and Content Using Supervised Machine Learning , 2015, Journal of medical Internet research.