On the ground validation of online diagnosis with Twitter and medical records

Social media has been considered as a data source for tracking disease. However, most analyses are based on models that prioritize strong correlation with population-level disease rates over determining whether or not specific individual users are actually sick. Taking a different approach, we develop a novel system for social-media based disease detection at the individual level using a sample of professionally diagnosed individuals. Specifically, we develop a system for making an accurate influenza diagnosis based on an individual's publicly available Twitter data. We find that about half (17/35 = 48.57%) of the users in our sample that were sick explicitly discuss their disease on Twitter. By developing a meta classifier that combines text analysis, anomaly detection, and social network analysis, we are able to diagnose an individual with greater than 99% accuracy even if she does not discuss her health.

[1]  F. E. Grubbs Procedures for Detecting Outlying Observations in Samples , 1969 .

[2]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[3]  G. Rodier,et al.  Hot spots in a wired world: WHO surveillance of emerging and re-emerging infectious diseases. , 2001, The Lancet. Infectious diseases.

[4]  Konstantina S. Nikita,et al.  A Meta-classifier Approach for Medical Diagnosis , 2004, SETN.

[5]  Saso Dzeroski,et al.  Combining Classifiers with Meta Decision Trees , 2003, Machine Learning.

[6]  F. Schellevis,et al.  Internet-based monitoring of influenza-like illness (ILI) in the general population of the Netherlands during the 2003–2004 influenza season , 2006, BMC public health.

[7]  S. V. van Noort,et al.  Gripenet: an internet-based system to monitor influenza-like illness uniformly across Europe. , 2007, Euro surveillance : bulletin Europeen sur les maladies transmissibles = European communicable disease bulletin.

[8]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[9]  Emily H. Chan,et al.  Global capacity for emerging infectious disease detection , 2010, Proceedings of the National Academy of Sciences.

[10]  David M. Pennock,et al.  Predicting consumer behavior with Web search , 2010, Proceedings of the National Academy of Sciences.

[11]  Aron Culotta,et al.  Towards detecting influenza epidemics by analyzing Twitter messages , 2010, SOMA '10.

[12]  Caroline O. Buckee,et al.  Digital Epidemiology , 2012, PLoS Comput. Biol..

[13]  Declan Butler,et al.  When Google got flu wrong , 2013, Nature.

[14]  Marcel Salathé,et al.  Validating models for disease detection using twitter , 2013, WWW.

[15]  Mark Dredze,et al.  Separating Fact from Fear: Tracking Flu Infections on Twitter , 2013, NAACL.

[16]  Cécile Viboud,et al.  Reassessing Google Flu Trends Data for Detection of Seasonal and Pandemic Influenza: A Comparative Epidemiological Study at Three Geographic Scales , 2013, PLoS Comput. Biol..