Validating models for disease detection using twitter

Data mining social media has become a valuable resource for infectious disease surveillance. However, there are considerable risks associated with incorrectly predicting an epidemic. The large amount of social media data combined with the small amount of ground truth data and the general dynamics of infectious diseases present unique challenges when evaluating model performance. In this paper, we look at several methods that have been used to assess influenza prevalence using Twitter. We then validate them with tests that are designed to avoid and illustrate issues with the standard k-fold cross validation method. We also find that small modifications to the way that data are partitioned can have major effects on a model's reported performance.

[1]  Alberto Maria Segre,et al.  The Use of Twitter to Track Levels of Disease Activity and Public Concern in the U.S. during the Influenza A H1N1 Pandemic , 2011, PloS one.

[2]  Jeremy Ginsberg,et al.  Detecting influenza epidemics using search engine query data , 2009, Nature.

[3]  Stephen R. Marsland,et al.  Machine Learning - An Algorithmic Perspective , 2009, Chapman and Hall / CRC machine learning and pattern recognition series.

[4]  David M. Pennock,et al.  Predicting consumer behavior with Web search , 2010, Proceedings of the National Academy of Sciences.

[5]  Connie St Louis,et al.  Can Twitter predict disease outbreaks? , 2012, BMJ : British Medical Journal.

[6]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[7]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[8]  Aron Culotta,et al.  Towards detecting influenza epidemics by analyzing Twitter messages , 2010, SOMA '10.

[9]  Declan Butler,et al.  When Google got flu wrong , 2013, Nature.

[10]  David Schlossberg,et al.  Clinical Infectious Disease: Clinical Syndromes – Respiratory Tract , 2008 .

[11]  C. Hawn Take two aspirin and tweet me in the morning: how Twitter, Facebook, and other social media are reshaping health care. , 2009, Health affairs.

[12]  Caroline O. Buckee,et al.  Digital Epidemiology , 2012, PLoS Comput. Biol..

[13]  Míriam Antón-Rodríguez,et al.  A content analysis of chronic diseases social groups on Facebook and Twitter. , 2012, Telemedicine journal and e-health : the official journal of the American Telemedicine Association.

[14]  Alexander J. Smola,et al.  Support Vector Regression Machines , 1996, NIPS.

[15]  Marcel Salathé,et al.  Assessing Vaccination Sentiments with Online Social Media: Implications for Infectious Disease Dynamics and Control , 2011, PLoS Comput. Biol..

[16]  Eleftherios Mylonakis,et al.  Google trends: a web-based tool for real-time surveillance of disease outbreaks. , 2009, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[17]  A. Dugas,et al.  Google Flu Trends: correlation with emergency department influenza rates and crowding metrics. , 2011, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.