Deceptiveness of internet data for disease surveillance

Quantifying how many people are or will be sick, and where, is a critical ingredient in reducing the burden of disease because it helps the public health system plan and implement effective outbreak response. This process of disease surveillance is currently based on data gathering using clinical and laboratory methods; this distributed human contact and resulting bureaucratic data aggregation yield expensive procedures that lag real time by weeks or months. The promise of new surveillance approaches using internet data, such as web event logs or social media messages, is to achieve the same goal but faster and cheaper. However, prior work in this area lacks a rigorous model of information flow, making it difficult to assess the reliability of both specific approaches and the body of work as a whole. We model disease surveillance as a Shannon communication. This new framework lets any two disease surveillance approaches be compared using a unified vocabulary and conceptual model. Using it, we describe and compare the deficiencies suffered by traditional and internet-based surveillance, introduce a new risk metric called deceptiveness, and offer mitigations for some of these deficiencies. This framework also makes the rich tools of information theory applicable to disease surveillance. This better understanding will improve the decision-making of public health practitioners by helping to leverage internet-based surveillance in a way complementary to the strengths of traditional surveillance.

[1]  David L. Buckeridge,et al.  Using age, triage score, and disposition data from emergency department electronic records to improve Influenza-like illness surveillance , 2015, J. Am. Medical Informatics Assoc..

[2]  M. Kretzschmar,et al.  Measuring underreporting and under-ascertainment in infectious disease datasets: a comparison of methods , 2014, BMC Public Health.

[3]  Samuel L Groseclose,et al.  Evaluation of reporting timeliness of public health surveillance systems for infectious diseases , 2004, BMC public health.

[4]  C D Brandt,et al.  Infectious disease epidemiology. , 1970, Clinical proceedings - Children's Hospital of the District of Columbia.

[5]  Maeve Duggan,et al.  Social Media Update 2016 , 2016 .

[6]  D. Lazer,et al.  The Parable of Google Flu: Traps in Big Data Analysis , 2014, Science.

[7]  Jan C. Semenza,et al.  European Monitoring Systems and Data for Assessing Environmental and Climate Impacts on Human Infectious Diseases , 2014, International journal of environmental research and public health.

[8]  C. Mathers,et al.  Projections of Global Mortality and Burden of Disease from 2002 to 2030 , 2006, PLoS medicine.

[9]  S B Thacker,et al.  A method for evaluating systems of epidemiological surveillance. , 1988, World health statistics quarterly. Rapport trimestriel de statistiques sanitaires mondiales.

[10]  T. Jones,et al.  Foodborne Diseases Active Surveillance Network—2 Decades of Achievements, 1996–2015 , 2015, Emerging infectious diseases.

[11]  S Michie,et al.  The impact of communications about swine flu (influenza A H1N1v) on public responses to the outbreak: results from 36 national telephone surveys in the UK. , 2010, Health technology assessment.

[12]  Jeremy Ginsberg,et al.  Detecting influenza epidemics using search engine query data , 2009, Nature.

[13]  Alicia Karspeck,et al.  Real-Time Influenza Forecasts during the 2012–2013 Season , 2013, Nature Communications.

[14]  D. Horstmann,et al.  Importance of disease surveillance. , 1974, Preventive medicine.

[15]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[16]  Graham Kalton,et al.  Introduction to Survey Sampling , 1983 .

[17]  L. Schieve,et al.  Estimated Prevalence of Autism and Other Developmental Disabilities Following Questionnaire Changes in the 2014 National Health Interview Survey. , 2015, National health statistics reports.

[18]  Aron Culotta,et al.  Lightweight methods to estimate influenza rates and alcohol sales volume from Twitter messages , 2012, Language Resources and Evaluation.

[19]  Edmund L. Gettier Is Justified True Belief Knowledge? , 1963, Arguing About Knowledge.

[20]  Marc Fischer,et al.  State Health Department Perceived Utility of and Satisfaction with ArboNET, the U.S. National Arboviral Surveillance System , 2012, Public health reports.

[21]  Declan Butler,et al.  When Google got flu wrong , 2013, Nature.

[22]  Ashlynn R. Daughton,et al.  Measuring Global Disease with Wikipedia: Success, Failure, and a Research Agenda , 2017, CSCW.

[23]  Kathleen M. Carley,et al.  Virtual epidemic in a virtual city: simulating the spread of influenza in a US metropolitan area. , 2008, Translational research : the journal of laboratory and clinical medicine.

[24]  Joshua D. Angrist,et al.  Mostly Harmless Econometrics: An Empiricist's Companion , 2008 .

[25]  Armin R. Mikler,et al.  Text and Structural Data Mining of Influenza Mentions in Web and Social Media , 2010, International journal of environmental research and public health.

[26]  A. Jena,et al.  Do celebrity endorsements matter? Observational study of BRCA gene testing and mastectomy rates after Angelina Jolie’s New York Times editorial , 2016, British Medical Journal.

[27]  Performance of case definitions for influenza surveillance. , 2015, Euro surveillance : bulletin Europeen sur les maladies transmissibles = European communicable disease bulletin.

[28]  Alina Deshpande,et al.  Global Disease Monitoring and Forecasting with Wikipedia , 2014, PLoS Comput. Biol..

[29]  Thacker Sb,et al.  A method for evaluating systems of epidemiological surveillance. , 1988 .

[30]  Deborah A. Adams,et al.  Summary of Notifiable Infectious Diseases and Conditions - United States, 2013. , 2015, MMWR. Morbidity and mortality weekly report.

[31]  João Gama,et al.  A survey on concept drift adaptation , 2014, ACM Comput. Surv..