The Parable of Google Flu: Traps in Big Data Analysis

Large errors in flu prediction were largely avoidable, which offers lessons for the use of big data. In February 2013, Google Flu Trends (GFT) made headlines but not for a reason that Google executives or the creators of the flu tracking system would have hoped. Nature reported that GFT was predicting more than double the proportion of doctor visits for influenza-like illness (ILI) than the Centers for Disease Control and Prevention (CDC), which bases its estimates on surveillance reports from laboratories across the United States (1, 2). This happened despite the fact that GFT was built to predict CDC reports. Given that GFT is often held up as an exemplary use of big data (3, 4), what lessons can we draw from this error?

[1]  Gail A. Herndon The chronicle of higher education , 1977 .

[2]  Richard Ashley Inflation and the Distribution of Price Changes across Markets: A Causal Analysis , 1981 .

[3]  F. Diebold,et al.  Comparing Predictive Accuracy , 1994, Business Cycles.

[4]  Robert J-P. Hauck Oh Monsieur Pasteur, We Hardly Knew You! , 1995 .

[5]  D. Cohen Chronicle of Higher Education , 1998 .

[6]  R. Platt,et al.  Using automated medical records for rapid identification of illness syndromes (syndromic surveillance): the example of lower respiratory infection , 2001, BMC public health.

[7]  Richard Schmalensee,et al.  Advertising and aggregate consumption: an analysis of causality , 1980 .

[8]  Richard A. Ashley,et al.  Statistically significant forecasting improvements: how much out-of-sample data is likely necessary? ☆ , 2003 .

[9]  Cécile Viboud,et al.  Prediction of the spread of influenza epidemics by the method of analogues. , 2003, American journal of epidemiology.

[10]  宁北芳,et al.  疟原虫var基因转换速率变化导致抗原变异[英]/Paul H, Robert P, Christodoulou Z, et al//Proc Natl Acad Sci U S A , 2005 .

[11]  S. Leach,et al.  Real-time epidemic forecasting for pandemic influenza , 2006, Epidemiology and Infection.

[12]  W. Thompson,et al.  Epidemiology of seasonal influenza: use of surveillance data and statistical models to estimate the burden of disease. , 2006, The Journal of infectious diseases.

[13]  Alessandro Vespignani,et al.  Multiscale mobility networks and the spatial spreading of infectious diseases , 2009, Proceedings of the National Academy of Sciences.

[14]  Jeremy Ginsberg,et al.  Detecting influenza epidemics using search engine query data , 2009, Nature.

[15]  R. Rosenfeld Nature , 2009, Otolaryngology--head and neck surgery : official journal of American Academy of Otolaryngology-Head and Neck Surgery.

[16]  A. Pentland,et al.  Computational Social Science , 2009, Science.

[17]  A. Vespignani Predicting the Behavior of Techno-Social Systems , 2009, Science.

[18]  A. Cook,et al.  Real-Time Epidemic Monitoring and Forecasting of H1N1-2009 Using Influenza-Like Illness from General Practice and Family Doctor Clinics in Singapore , 2010, PloS one.

[19]  David M. Pennock,et al.  Predicting consumer behavior with Web search , 2010, Proceedings of the National Academy of Sciences.

[20]  Dennis L. Chao,et al.  FluTE, a Publicly Available Stochastic Influenza Epidemic Simulation Model , 2010, PLoS Comput. Biol..

[21]  C. Goss,et al.  Monitoring Influenza Activity in the United States: A Comparison of Traditional Surveillance Systems with Google Flu Trends , 2011, PloS one.

[22]  Panagiotis Takis Metaxas,et al.  How (Not) to Predict Elections , 2011, 2011 IEEE Third Int'l Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third Int'l Conference on Social Computing.

[23]  G. King,et al.  Ensuring the Data-Rich Future of the Social Sciences , 2011, Science.

[24]  Johan Bollen,et al.  Twitter mood predicts the stock market , 2010, J. Comput. Sci..

[25]  Matthew Mohebbi,et al.  Assessing Google Flu Trends Performance in the United States during the 2009 Influenza Virus A (H1N1) Pandemic , 2011, PloS one.

[26]  D. Boyd,et al.  CRITICAL QUESTIONS FOR BIG DATA , 2012 .

[27]  Alessandro Vespignani,et al.  Beating the news using social media: the case study of American Idol , 2012, EPJ Data Science.

[28]  J. Paul BMC public health. , 2012, World health & population.

[29]  Erik Brynjolfsson,et al.  Big data: the management revolution. , 2012, Harvard business review.

[30]  J. Shaman,et al.  Forecasting seasonal outbreaks of influenza , 2012, Proceedings of the National Academy of Sciences.

[31]  Adam J. Berinsky,et al.  Evaluating Online Labor Markets for Experimental Research: Amazon.com's Mechanical Turk , 2012, Political Analysis.

[32]  Camille Pelat,et al.  A Method to Assess Seasonality of Urinary Tract Infections Based on Medication Sales and Google Trends , 2013, PloS one.

[33]  Fabian M. Suchanek,et al.  Proceedings of the 22nd International World Wide Web Conference, WWW'13 , 2013, WWW 2013.

[34]  Declan Butler,et al.  When Google got flu wrong , 2013, Nature.

[35]  E. Nsoesie,et al.  A Simulation Optimization Approach to Epidemic Forecasting , 2013, PloS one.

[36]  M. Smolinski,et al.  Flu Near You: An Online Self-reported Influenza Surveillance System in the USA , 2013, Online Journal of Public Health Informatics.

[37]  Cécile Viboud,et al.  Reassessing Google Flu Trends Data for Detection of Seasonal and Pandemic Influenza: A Comparative Epidemiological Study at Three Geographic Scales , 2013, PLoS Comput. Biol..

[38]  Alicia Karspeck,et al.  Real-Time Influenza Forecasts during the 2012–2013 Season , 2013, Nature Communications.