Global Disease Monitoring and Forecasting with Wikipedia

Infectious disease is a leading threat to public health, economic stability, and other key social structures. Efforts to mitigate these impacts depend on accurate and timely monitoring to measure the risk and progress of disease. Traditional, biologically-focused monitoring techniques are accurate but costly and slow; in response, new techniques based on social internet data, such as social media and search queries, are emerging. These efforts are promising, but important challenges in the areas of scientific peer review, breadth of diseases and countries, and forecasting hamper their operational usefulness. We examine a freely available, open data source for this use: access logs from the online encyclopedia Wikipedia. Using linear models, language as a proxy for location, and a systematic yet simple article selection procedure, we tested 14 location-disease combinations and demonstrate that these data feasibly support an approach that overcomes these challenges. Specifically, our proof-of-concept yields models with up to 0.92, forecasting value up to the 28 days tested, and several pairs of models similar enough to suggest that transferring models from one location to another without re-training is feasible. Based on these preliminary results, we close with a research agenda designed to overcome these challenges and produce a disease monitoring and forecasting system that is significantly more effective, robust, and globally comprehensive than the current state of the art.

[1]  J. Giles Internet encyclopaedias go head to head , 2005, Nature.

[2]  Mark Dredze,et al.  You Are What You Tweet: Analyzing Twitter for Public Health , 2011, ICWSM.

[3]  J. Aucott,et al.  The utility of "Google Trends" for epidemiological research: Lyme disease as an example. , 2010, Geospatial health.

[4]  Xi-chuan Zhou,et al.  Notifiable infectious disease surveillance with data collected by search engine , 2010, Journal of Zhejiang University SCIENCE C.

[5]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[6]  A Hulth,et al.  Web query-based surveillance in Sweden during the influenza A(H1N1)2009 pandemic, April 2009 to February 2010. , 2011, Euro surveillance : bulletin Europeen sur les maladies transmissibles = European communicable disease bulletin.

[7]  Erin Burns,et al.  Estimates of deaths associated with seasonal influenza --- United States, 1976-2007. , 2010, MMWR. Morbidity and mortality weekly report.

[8]  Gunther Eysenbach,et al.  Infodemiology: Tracking Flu-Related Searches on the Web for Syndromic Surveillance , 2006, AMIA.

[9]  S. Rutherford,et al.  Using Google Trends for Influenza Surveillance in South China , 2013, PloS one.

[10]  Alberto Maria Segre,et al.  The Use of Twitter to Track Levels of Disease Activity and Public Concern in the U.S. during the Influenza A H1N1 Pandemic , 2011, PloS one.

[11]  Diane S. Lauderdale,et al.  Internet Queries and Methicillin-Resistant Staphylococcus aureus Surveillance , 2011, Emerging Infectious Diseases.

[12]  Jeremy Ginsberg,et al.  Detecting influenza epidemics using search engine query data , 2009, Nature.

[13]  J. D. de Wolff,et al.  An Evaluation of Wikipedia as a Resource for Patient Education in Nephrology , 2013, Seminars in dialysis.

[14]  Brian H. Spitzberg,et al.  The Complex Relationship of Realspace Events and Messages in Cyberspace: Case Study of Influenza and Pertussis Using Tweets , 2013, Journal of medical Internet research.

[15]  M. Smolinski,et al.  Flu Near You: An Online Self-reported Influenza Surveillance System in the USA , 2013, Online Journal of Public Health Informatics.

[16]  A. Hulth,et al.  Web Queries as a Source for Syndromic Surveillance , 2009, PloS one.

[17]  Anette Hulth,et al.  Eye-Opening Approach to Norovirus Surveillance , 2010, Emerging infectious diseases.

[18]  Rumi Chunara,et al.  Online reporting for malaria surveillance using micro-monetary incentives, in urban India 2010-2011 , 2012, Malaria Journal.

[19]  Nello Cristianini,et al.  Tracking the flu pandemic by monitoring the social web , 2010, 2010 2nd International Workshop on Cognitive Information Processing.

[20]  James A Gillespie,et al.  Internet Search Patterns of Human Immunodeficiency Virus and the Digital Divide in the Russian Federation: Infoveillance Study , 2013, Journal of medical Internet research.

[21]  Ronald E. Rice,et al.  Influences, usage, and outcomes of Internet health information searching: Multivariate results from the Pew surveys , 2006, Int. J. Medical Informatics.

[22]  Daniel J. Bachmann,et al.  Biosurveillance: A Review and Update , 2012, Advances in preventive medicine.

[23]  M. Osborne,et al.  Using Prediction Markets and Twitter to Predict a Swine Flu Pandemic , 2009 .

[24]  Michaël,et al.  Seeking health information online: does Wikipedia matter? , 2009, Journal of the American Medical Informatics Association : JAMIA.

[25]  Natalie Kupferberg,et al.  Accuracy and completeness of drug information in Wikipedia: an assessment. , 2011, Journal of the Medical Library Association : JMLA.

[26]  C. Bridges,et al.  The annual impact of seasonal influenza in the US: measuring disease burden and costs. , 2007, Vaccine.

[27]  V. Dukic,et al.  Internet Queries and Methicillin-Resistant Staphylococcus aureus Surveillance , 2011, Emerging infectious diseases.

[28]  A. Hagihara,et al.  Internet suicide searches and the incidence of suicide in young people in Japan , 2011, European Archives of Psychiatry and Clinical Neuroscience.

[29]  Wendy W. Chapman,et al.  Analysis of Web Access Logs for Surveillance of Influenza , 2004, MedInfo.

[30]  Emily H. Chan,et al.  Using Web Search Query Data to Monitor Dengue Epidemics: A New Model for Neglected Tropical Disease Surveillance , 2011, PLoS neglected tropical diseases.

[31]  A. Flahault,et al.  More Diseases Tracked by Using Google Trends , 2009, Emerging infectious diseases.

[32]  Jang Seok Oh,et al.  Use of Hangeul Twitter to Track and Predict Human Influenza Infection , 2013, PloS one.

[33]  H. Eugene Stanley,et al.  Quantifying Wikipedia Usage Patterns Before Stock Market Moves , 2013, Scientific Reports.

[34]  Virgílio A. F. Almeida,et al.  Dengue surveillance based on a computational model of spatio-temporal locality of Twitter , 2011, WebSci '11.

[35]  J. Brownstein,et al.  Social and news media enable estimation of epidemiological patterns early in the 2010 Haitian cholera outbreak. , 2012, The American journal of tropical medicine and hygiene.

[36]  D. Buckeridge,et al.  Systematic Review: Surveillance Systems for Early Detection of Bioterrorism-Related Diseases , 2004, Annals of Internal Medicine.

[37]  M. Osborne,et al.  Bieber no more : First Story Detection using Twitter and Wikipedia , 2012 .

[38]  David M. Pennock,et al.  Using internet searches for influenza surveillance. , 2008, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[39]  Kenneth D. Mandl,et al.  HealthMap: Global Infectious Disease Monitoring through Automated Classification and Visualization of Internet Media Reports , 2008, Journal of the American Medical Informatics Association.

[40]  Alicia Karspeck,et al.  Real-Time Influenza Forecasts during the 2012–2013 Season , 2013, Nature Communications.

[41]  Kjetil Nørvåg,et al.  WikiPop: personalized event detection system based on Wikipedia page view statistics , 2010, CIKM '10.

[42]  Piotr Gawrysiak,et al.  Using Web Mining for Discovering Spatial Patterns and Hot Spots for Spatial Generalization , 2012, ISMIS.

[43]  Ebola, Uganda. , 2000, Releve epidemiologique hebdomadaire.

[44]  Cécile Viboud,et al.  Reassessing Google Flu Trends Data for Detection of Seasonal and Pandemic Influenza: A Comparative Epidemiological Study at Three Geographic Scales , 2013, PLoS Comput. Biol..

[45]  Son Doan,et al.  Enhancing Twitter Data Analysis with Simple Semantic Filtering: Example in Tracking Influenza-Like Illnesses , 2012, 2012 IEEE Second International Conference on Healthcare Informatics, Imaging and Systems Biology.

[46]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[47]  Marijn ten Thij,et al.  Modelling page-view dynamics on Wikipedia , 2013 .

[48]  Mark S Dworkin,et al.  Categorization, prioritization, and surveillance of potential bioterrorism agents. , 2006, Infectious disease clinics of North America.

[49]  Marijn ten Thij,et al.  Modeling and predicting page-view dynamics on Wikipedia , 2012, ArXiv.

[50]  Chris Callison-Burch,et al.  WikiTopics: What is Popular on Wikipedia and Why , 2011 .

[51]  Kate Faasse,et al.  Public Anxiety and Information Seeking Following the H1N1 Outbreak: Blogs, Newspaper Articles, and Wikipedia Visits , 2012, Health communication.

[52]  L. Hutwagner,et al.  The bioterrorism preparedness and response Early Aberration Reporting System (EARS) , 2003, Journal of Urban Health.

[53]  Michael J. Paul,et al.  National and Local Influenza Surveillance through Twitter: An Analysis of the 2012-2013 Influenza Epidemic , 2013, PloS one.

[54]  Son Doan,et al.  BioCaster: detecting public health rumors with a Web-based text mining system , 2008, Bioinform..

[55]  Eleftherios Mylonakis,et al.  Google trends: a web-based tool for real-time surveillance of disease outbreaks. , 2009, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[56]  Ś. Sen,et al.  Use of Google Insights for Search to track seasonal and geographic kidney stone incidence in the United States. , 2011, Urology.

[57]  Alan D. Lopez,et al.  Global and regional burden of disease and risk factors, 2001: systematic analysis of population health data , 2006, The Lancet.

[58]  J. Brownstein,et al.  Early detection of disease outbreaks using the Internet , 2009, Canadian Medical Association Journal.

[59]  Nello Cristianini,et al.  Nowcasting Events from the Social Web with Statistical Learning , 2012, TIST.

[60]  Taha Yasseri,et al.  Can electoral popularity be predicted using socially generated big data? , 2013, it Inf. Technol..

[61]  John S. Brownstein,et al.  Wikipedia Usage Estimates Prevalence of Influenza-Like Illness in the United States in Near Real-Time , 2014, PLoS Comput. Biol..

[62]  Yossi Matias,et al.  Norovirus disease surveillance using Google Internet query share data. , 2012, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[63]  J. Ayers,et al.  Seasonality in seeking mental health information on Google. , 2013, American journal of preventive medicine.

[64]  Thanassis Tiropanis,et al.  An approach for using Wikipedia to measure the flow of trends across countries , 2013, WWW.

[65]  Benyuan Liu,et al.  Twitter Improves Seasonal Influenza Prediction , 2018, HEALTHINF.

[66]  Susan M. Mniszewski,et al.  Understanding the Impact of Face Mask Usage Through Epidemic Simulation of Large Social Networks , 2013, Theories and Simulations of Complex Social Systems.

[67]  John Riedl,et al.  Creating, destroying, and restoring value in wikipedia , 2007, GROUP.

[68]  Yiqun Liu,et al.  Predicting Epidemic Tendency through Search Behavior Analysis , 2011, IJCAI.

[69]  E. Nsoesie,et al.  Monitoring Influenza Epidemics in China with Search Query from Baidu , 2013, PloS one.

[70]  A. Dicker,et al.  Patient-oriented cancer information on the internet: a comparison of wikipedia and a professionally maintained database. , 2011, Journal of oncology practice.

[71]  Declan Butler,et al.  When Google got flu wrong , 2013, Nature.

[72]  Crystale Purvis Cooper,et al.  Cancer Internet Search Activity on a Major Search Engine, United States 2001-2003 , 2005, Journal of medical Internet research.

[73]  Y. Gel,et al.  Influenza Forecasting with Google Flu Trends , 2013, PloS one.

[74]  Aron Culotta,et al.  Lightweight methods to estimate influenza rates and alcohol sales volume from Twitter messages , 2012, Language Resources and Evaluation.

[75]  Reinhard Windhager,et al.  Wikipedia and osteosarcoma: a trustworthy patients' information? , 2010, J. Am. Medical Informatics Assoc..

[76]  Peter Christen,et al.  Cross Language Prediction of Vandalism on Wikipedia Using Article Views and Revisions , 2013, PAKDD.

[77]  Jian Ma,et al.  A neural netwok based approach to detect influenza epidemics using search engine query data , 2010, 2010 International Conference on Machine Learning and Cybernetics.

[78]  Robert Colin,et al.  Ministère de la santé publique et de la population. Preuve de Nationalité , 1948 .

[79]  Han Zhao,et al.  Monitoring Epidemic Alert Levels by Analyzing Internet Search Volume , 2013, IEEE Transactions on Biomedical Engineering.

[80]  J. Brownstein,et al.  Using search queries for malaria surveillance, Thailand , 2013, Malaria Journal.

[81]  Hideo Hirose,et al.  Prediction of Infectious Disease Spread Using Twitter: A Case of Influenza , 2012, 2012 Fifth International Symposium on Parallel Architectures, Algorithms and Programming.

[82]  Anette Hulth,et al.  Head Lice Surveillance on a Deregulated OTC-Sales Market: A Study Using Web Query Data , 2012, PloS one.

[83]  Mark Dredze,et al.  Separating Fact from Fear: Tracking Flu Infections on Twitter , 2013, NAACL.

[84]  Jae Ho Lee,et al.  Correlation between National Influenza Surveillance Data and Google Trends in South Korea , 2013, PloS one.

[85]  Alina Deshpande,et al.  Global Disease Monitoring and Forecasting with Wikipedia , 2016 .

[86]  Li Na,et al.  Gonorrhea incidence forecasting research based on Baidu search data , 2013, 2013 International Conference on Management Science and Engineering 20th Annual Conference Proceedings.

[87]  B. Nahed,et al.  Determination of geographic variance in stroke prevalence using Internet search engine analytics. , 2011, Neurosurgical focus.

[88]  Andreas Dengel,et al.  Analysis and forecasting of trending topics in online media streams , 2013, ACM Multimedia.

[89]  Mizuki Morita,et al.  Twitter Catches The Flu: Detecting Influenza Epidemics using Twitter , 2011, EMNLP.

[90]  Benyuan Liu,et al.  Predicting Flu Trends using Twitter data , 2011, 2011 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS).

[91]  Joseph Bernstein,et al.  Quality of information on the Internet about carpal tunnel syndrome: an update. , 2013, Orthopedics.

[92]  C. Peng,et al.  Association of Internet search trends with suicide death in Taipei City, Taiwan, 2004-2009. , 2011, Journal of affective disorders.

[93]  D. Cummings,et al.  Prediction of Dengue Incidence Using Search Query Surveillance , 2011, PLoS neglected tropical diseases.

[94]  Taha Yasseri,et al.  Early Prediction of Movie Box Office Success Based on Wikipedia Activity Big Data , 2012, PloS one.