Discovering Health Topics in Social Media Using Topic Models

By aggregating self-reported health statuses across millions of users, we seek to characterize the variety of health information discussed in Twitter. We describe a topic modeling framework for discovering health topics in Twitter, a social media website. This is an exploratory approach with the goal of understanding what health topics are commonly discussed in social media. This paper describes in detail a statistical topic model created for this purpose, the Ailment Topic Aspect Model (ATAM), as well as our system for filtering general Twitter data based on health keywords and supervised classification. We show how ATAM and other topic models can automatically infer health topics in 144 million Twitter messages from 2011 to 2013. ATAM discovered 13 coherent clusters of Twitter messages, some of which correlate with seasonal influenza (r = 0.689) and allergies (r = 0.810) temporal surveillance data, as well as exercise (r = .534) and obesity (r = −.631) related geographic survey data in the United States. These results demonstrate that it is possible to automatically discover topics that attain statistically significant correlations with ground truth data, despite using minimal human supervision and no historical data to train the model, in contrast to prior work. Additionally, these results demonstrate that a single general-purpose model can identify many different health topics in social media.

[1]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Andrew P. Sage,et al.  Uncertainty in Artificial Intelligence , 1987, IEEE Transactions on Systems, Man, and Cybernetics.

[3]  Ali H. Mokdad,et al.  Behavioral risk factor surveillance system. , 1989, Iowa medicine : journal of the Iowa Medical Society.

[4]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[5]  Bei Yu,et al.  A cross-collection mixture model for comparative text mining , 2004, KDD.

[6]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Padhraic Smyth,et al.  Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model , 2006, NIPS.

[8]  Max Welling,et al.  Distributed Inference for Latent Dirichlet Allocation , 2007, NIPS.

[9]  Dmitriy Fradkin,et al.  Anticipating annotations and emerging trends in biomedical literature , 2008, KDD.

[10]  Yee Whye Teh,et al.  On Smoothing and Inference for Topic Models , 2009, UAI.

[11]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[12]  Michael J. Paul,et al.  Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models , 2009, EMNLP.

[13]  Nello Cristianini,et al.  Tracking the flu pandemic by monitoring the social web , 2010, 2010 2nd International Workshop on Cognitive Information Processing.

[14]  Eric P. Xing,et al.  Staying Informed: Supervised and Semi-Supervised Multi-View Topical Analysis of Ideological Perspective , 2010, EMNLP.

[15]  Isabell M. Welpe,et al.  Predicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment , 2010, ICWSM.

[16]  Brendan T. O'Connor,et al.  From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series , 2010, ICWSM.

[17]  Yutaka Matsuo,et al.  Earthquake shakes Twitter users: real-time event detection by social sensors , 2010, WWW '10.

[18]  Chris Callison-Burch,et al.  Creating Speech and Language Data With Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[19]  E. Larson,et al.  Dissemination of health information through social networks: twitter and antibiotics. , 2010, American journal of infection control.

[20]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[21]  Miles Osborne,et al.  Streaming First Story Detection with application to Twitter , 2010, NAACL.

[22]  Junlan Feng,et al.  Robust Sentiment Detection on Twitter from Biased and Noisy Data , 2010, COLING.

[23]  Michael J. Paul,et al.  A Two-Dimensional Topic-Aspect Model for Discovering Multi-Faceted Topics , 2010, AAAI.

[24]  Aron Culotta,et al.  Towards detecting influenza epidemics by analyzing Twitter messages , 2010, SOMA '10.

[25]  Victor R. Preedy,et al.  Behavioral Risk Factor Surveillance System , 2010 .

[26]  Nathan K. Cobb,et al.  Online Social Networks and Smoking Cessation: A Scientific Research Agenda , 2011, Journal of medical Internet research.

[27]  Sune Lehmann,et al.  Understanding the Demographics of Twitter Users , 2011, ICWSM.

[28]  Mizuki Morita,et al.  Twitter Catches The Flu: Detecting Influenza Epidemics using Twitter , 2011, EMNLP.

[29]  J. Qiu,et al.  Finding Complex Biological Relationships in Recent PubMed Articles Using Bio-LDA , 2011, PloS one.

[30]  N. Heaivilin,et al.  Public Health Surveillance of Dental Pain via Twitter , 2011, Journal of dental research.

[31]  Scott A. Golder,et al.  Diurnal and Seasonal Mood Vary with Work, Sleep, and Daylength Across Diverse Cultures , 2011 .

[32]  Mark Dredze,et al.  You Are What You Tweet: Analyzing Twitter for Public Health , 2011, ICWSM.

[33]  Isabell M. Welpe,et al.  Election Forecasts With Twitter , 2011 .

[34]  J. Brownstein,et al.  Social and news media enable estimation of epidemiological patterns early in the 2010 Haitian cholera outbreak. , 2012, The American journal of tropical medicine and hygiene.

[35]  T. Minka Estimating a Dirichlet distribution , 2012 .

[36]  Mark Dredze,et al.  How Social Media Will Change Public Health , 2012, IEEE Intelligent Systems.

[37]  Libby N Brockman,et al.  Associations between displayed alcohol references on Facebook and problem drinking among college students. , 2012, Archives of pediatrics & adolescent medicine.

[38]  Henry A. Kautz,et al.  Modeling Spread of Disease from Social Interactions , 2012, ICWSM.

[39]  Michael J. Paul Mixed Membership Markov Models for Unsupervised Conversation Modeling , 2012, EMNLP.

[40]  Mark Dredze,et al.  Drug Extraction from the Web: Summarizing Drug Experiences with Multi-Dimensional Topic Models , 2013, NAACL.

[41]  Aron Culotta,et al.  Lightweight methods to estimate influenza rates and alcohol sales volume from Twitter messages , 2012, Language Resources and Evaluation.

[42]  David A Asch,et al.  Decoding twitter: Surveillance and trends for cardiac arrest and resuscitation communication. , 2013, Resuscitation.

[43]  Michael J. Paul,et al.  Carmen: A Twitter Geolocation System with Applications to Public Health , 2013 .

[44]  Mark Dredze,et al.  Separating Fact from Fear: Tracking Flu Infections on Twitter , 2013, NAACL.

[45]  Sunmoo Yoon,et al.  A practical approach for content mining of Tweets. , 2013, American journal of preventive medicine.

[46]  Eric Horvitz,et al.  Predicting Depression via Social Media , 2013, ICWSM.

[47]  Mark Dredze,et al.  Could behavioral medicine lead the web data revolution? , 2014, JAMA.