Reappraising the utility of Google Flu Trends

Estimation of influenza-like illness (ILI) using search trends activity was intended to supplement traditional surveillance systems, and was a motivation behind the development of Google Flu Trends (GFT). However, several studies have previously reported large errors in GFT estimates of ILI in the US. Following recent release of time-stamped surveillance data, which better reflects real-time operational scenarios, we reanalyzed GFT errors. Using three data sources—GFT: an archive of weekly ILI estimates from Google Flu Trends; ILIf: fully-observed ILI rates from ILINet; and, ILIp: ILI rates available in real-time based on partial reporting—five influenza seasons were analyzed and mean square errors (MSE) of GFT and ILIp as estimates of ILIf were computed. To correct GFT errors, a random forest regression model was built with ILI and GFT rates from the previous three weeks as predictors. An overall reduction in error of 44% was observed and the errors of the corrected GFT are lower than those of ILIp. An 80% reduction in error during 2012/13, when GFT had large errors, shows that extreme failures of GFT could have been avoided. Using autoregressive integrated moving average (ARIMA) models, one- to four-week ahead forecasts were generated with two separate data streams: ILIp alone, and with both ILIp and corrected GFT. At all forecast targets and seasons, and for all but two regions, inclusion of GFT lowered MSE. Results from two alternative error measures, mean absolute error and mean absolute proportional error, were largely consistent with results from MSE. Taken together these findings provide an error profile of GFT in the US, establish strong evidence for the adoption of search trends based 'nowcasts' in influenza forecast systems, and encourage reevaluation of the utility of this data source in diverse domains.

[1]  A. Flahault,et al.  Medication Sales and Syndromic Surveillance, France , 2006, Emerging infectious diseases.

[2]  David M. Pennock,et al.  Using internet searches for influenza surveillance. , 2008, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[3]  P. Phillips,et al.  Testing the null hypothesis of stationarity against the alternative of a unit root: How sure are we that economic time series have a unit root? , 1992 .

[4]  Jeremy Ginsberg,et al.  Detecting influenza epidemics using search engine query data , 2009, Nature.

[5]  Fotios Petropoulos,et al.  forecast: Forecasting functions for time series and linear models , 2018 .

[6]  H. Akaike A new look at the statistical model identification , 1974 .

[7]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[8]  Philip H. Ramsey Nonparametric Statistical Methods , 1974, Technometrics.

[9]  Hal Hodson [dn25217] Google Flu Trends gets it wrong three years running , 2014 .

[10]  Laurent Hébert-Dufresne,et al.  Enhancing disease surveillance with novel data streams: challenges and opportunities , 2015, EPJ Data Science.

[11]  Reid Priedhorsky,et al.  Even a good influenza forecasting model can benefit from internet-based nowcasts, but those benefits are limited , 2019, PLoS Comput. Biol..

[12]  Mark Dredze,et al.  HealthTweets.org: A Platform for Public Health Surveillance Using Twitter , 2014, AAAI 2014.

[13]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[14]  Cécile Viboud,et al.  Reassessing Google Flu Trends Data for Detection of Seasonal and Pandemic Influenza: A Comparative Epidemiological Study at Three Geographic Scales , 2013, PLoS Comput. Biol..

[15]  Sanjiv Kumar,et al.  Google Correlate Whitepaper , 2011 .

[16]  D. Bauer Constructing Confidence Sets Using Rank Statistics , 1972 .

[17]  Sasikiran Kandula,et al.  Subregional Nowcasts of Seasonal Influenza Using Search Trends , 2017, Journal of medical Internet research.

[18]  W. John Boscardin,et al.  Evaluating Google Flu Trends in Latin America: Important Lessons for the Next Phase of Digital Disease Detection , 2017, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[19]  D. Lazer,et al.  The Parable of Google Flu: Traps in Big Data Analysis , 2014, Science.

[20]  Haruka Morita,et al.  Evaluation of mechanistic and statistical methods in forecasting influenza-like illness , 2018, Journal of The Royal Society Interface.

[21]  Mark Dredze,et al.  Combining Search, Social Media, and Traditional Data Sources to Improve Influenza Surveillance , 2015, PLoS Comput. Biol..

[22]  R. Rosenfeld,et al.  A collaborative multiyear, multimodel assessment of seasonal influenza forecasting in the United States , 2019, Proceedings of the National Academy of Sciences.

[23]  Ronald Rosenfeld,et al.  Nonmechanistic forecasts of seasonal influenza with iterative one-week-ahead distributions , 2018, PLoS Comput. Biol..

[24]  Michael J. Paul,et al.  Twitter Improves Influenza Forecasting , 2014, PLoS currents.

[25]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[26]  Rob J Hyndman,et al.  Automatic Time Series Forecasting: The forecast Package for R , 2008 .

[27]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[28]  Cécile Viboud,et al.  Demonstrating the Use of High-Volume Electronic Medical Claims Data to Monitor Local and Regional Influenza Activity in the US , 2014, PloS one.

[29]  M. Santillana,et al.  What can digital disease detection learn from (an external revision to) Google Flu Trends? , 2014, American journal of preventive medicine.

[30]  Cécile Viboud,et al.  Infectious Disease Surveillance in the Big Data Era: Towards Faster and Locally Relevant Systems. , 2016, The Journal of infectious diseases.

[31]  Rumi Chunara,et al.  Flu Near You: Crowdsourced Symptom Reporting Spanning 2 Influenza Seasons. , 2015, American journal of public health.