Subregional Nowcasts of Seasonal Influenza Using Search Trends

Background Limiting the adverse effects of seasonal influenza outbreaks at state or city level requires close monitoring of localized outbreaks and reliable forecasts of their progression. Whereas forecasting models for influenza or influenza-like illness (ILI) are becoming increasingly available, their applicability to localized outbreaks is limited by the nonavailability of real-time observations of the current outbreak state at local scales. Surveillance data collected by various health departments are widely accepted as the reference standard for estimating the state of outbreaks, and in the absence of surveillance data, nowcast proxies built using Web-based activities such as search engine queries, tweets, and access of health-related webpages can be useful. Nowcast estimates of state and municipal ILI were previously published by Google Flu Trends (GFT); however, validations of these estimates were seldom reported. Objective The aim of this study was to develop and validate models to nowcast ILI at subregional geographic scales. Methods We built nowcast models based on autoregressive (autoregressive integrated moving average; ARIMA) and supervised regression methods (Random forests) at the US state level using regional weighted ILI and Web-based search activity derived from Google's Extended Trends application programming interface. We validated the performance of these methods using actual surveillance data for the 50 states across six seasons. We also built state-level nowcast models using state-level estimates of ILI and compared the accuracy of these estimates with the estimates of the regional models extrapolated to the state level and with the nowcast estimates published by GFT. Results Models built using regional ILI extrapolated to state level had a median correlation of 0.84 (interquartile range: 0.74-0.91) and a median root mean square error (RMSE) of 1.01 (IQR: 0.74-1.50), with noticeable variability across seasons and by state population size. Model forms that hypothesize the availability of timely state-level surveillance data show significantly lower errors of 0.83 (0.55-0.23). Compared with GFT, the latter model forms have lower errors but also lower correlation. Conclusions These results suggest that the proposed methods may be an alternative to the discontinued GFT and that further improvements in the quality of subregional nowcasts may require increased access to more finely resolved surveillance data.

[1]  M. Santillana,et al.  What can digital disease detection learn from (an external revision to) Google Flu Trends? , 2014, American journal of preventive medicine.

[2]  Jonathan D. Cryer,et al.  Time Series Analysis , 1986 .

[3]  John S. Brownstein,et al.  Wikipedia Usage Estimates Prevalence of Influenza-Like Illness in the United States in Near Real-Time , 2014, PLoS Comput. Biol..

[4]  Michael J. Paul,et al.  Using Social Media to Perform Local Influenza Surveillance in an Inner-City Hospital: A Retrospective Observational Study , 2015, JMIR public health and surveillance.

[5]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[6]  W. John Boscardin,et al.  Evaluating Google Flu Trends in Latin America: Important Lessons for the Next Phase of Digital Disease Detection , 2017, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[7]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[8]  John S. Brownstein,et al.  Nowcasting influenza activity using Healthmap data. , 2015 .

[9]  Michael J. Paul,et al.  National and Local Influenza Surveillance through Twitter: An Analysis of the 2012-2013 Influenza Epidemic , 2013, PloS one.

[10]  David M. Pennock,et al.  Using internet searches for influenza surveillance. , 2008, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[11]  Mauricio Santillana,et al.  Accurate estimation of influenza epidemics using Google search data via ARGO , 2015, Proceedings of the National Academy of Sciences.

[12]  Mark Dredze,et al.  Combining Search, Social Media, and Traditional Data Sources to Improve Influenza Surveillance , 2015, PLoS Comput. Biol..

[13]  James M. Hyman,et al.  Forecasting the 2013–2014 Influenza Season Using Wikipedia , 2014, PLoS Comput. Biol..

[14]  Jeremy Ginsberg,et al.  Detecting influenza epidemics using search engine query data , 2009, Nature.

[15]  D. Bauer Constructing Confidence Sets Using Rank Statistics , 1972 .

[16]  Rob J Hyndman,et al.  Automatic Time Series Forecasting: The forecast Package for R , 2008 .

[17]  Sanjiv Kumar,et al.  Google Correlate Whitepaper , 2011 .

[18]  Andrew C. Miller,et al.  Advances in nowcasting influenza-like illness rates using search query logs , 2015, Scientific Reports.

[19]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[20]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[21]  Melvin J. Hinich,et al.  Time Series Analysis by State Space Methods , 2001 .

[22]  Michael J. Paul,et al.  Twitter Improves Influenza Forecasting , 2014, PLoS currents.

[23]  Gunther Eysenbach,et al.  Infodemiology: Tracking Flu-Related Searches on the Web for Syndromic Surveillance , 2006, AMIA.

[24]  Cécile Viboud,et al.  Reassessing Google Flu Trends Data for Detection of Seasonal and Pandemic Influenza: A Comparative Epidemiological Study at Three Geographic Scales , 2013, PLoS Comput. Biol..

[25]  M. Friedman The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1937 .

[26]  Michael J. Paul,et al.  Carmen: A Twitter Geolocation System with Applications to Public Health , 2013 .

[27]  Jiaquan Xu,et al.  Deaths: Final Data for 2013. , 2016, National vital statistics reports : from the Centers for Disease Control and Prevention, National Center for Health Statistics, National Vital Statistics System.

[28]  Rumi Chunara,et al.  Flu Near You: Crowdsourced Symptom Reporting Spanning 2 Influenza Seasons. , 2015, American journal of public health.

[29]  Wendong Zhang Development of a Real-Time Estimate of Flu Activity in the United States Using Dynamically Updated Lasso Regressions and Google Search Queries , 2013 .

[30]  James M Wilson,et al.  The Next Chapter. , 2015, Human gene therapy.

[31]  Christian Köhler,et al.  Health-related searches on the Internet. , 2004, JAMA.

[32]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .