Using search engine big data for predicting new HIV diagnoses

Background A large and growing body of “big data” is generated by internet search engines, such as Google. Because people often search for information about public health and medical issues, researchers may be able to use search engine data to monitor and predict public health problems, such as HIV. We sought to assess the feasibility of using Google search data to analyze and predict new HIV diagnoses cases in the United States. Methods and findings From 2007 to 2014, we collected search volume data on HIV-related Google search keywords across the United States. State-level new HIV diagnoses data were collected from the Centers for Disease Control and Prevention (CDC) and AIDSVu.org. We developed a negative binomial model to predict HIV cases using a subset of significant predictor keywords identified by LASSO. The Google search data were combined with state-level HIV case reports provided by the CDC. We use historical data to train the model and predict new HIV diagnoses from 2011 to 2014, with an average R2 value of 0.99 between predicted versus actual cases, and average root-mean-square error (RMSE) of 108.75. Conclusions Results indicate that Google Trends is a feasible tool to predict new cases of HIV at the state level. We discuss the implications of integrating visualization maps and tools based on these models into public health and HIV monitoring and surveillance.

[1]  B. Lewis,et al.  Methods of using real-time social media technologies for detection and remote monitoring of HIV outcomes. , 2014, Preventive medicine.

[2]  J. T. Wulu,et al.  Regression analysis of count data , 2002 .

[3]  Alberto Maria Segre,et al.  The Use of Twitter to Track Levels of Disease Activity and Public Concern in the U.S. during the Influenza A H1N1 Pandemic , 2011, PloS one.

[4]  Jeremy Ginsberg,et al.  Detecting influenza epidemics using search engine query data , 2009, Nature.

[5]  Mauricio Santillana,et al.  Accurate estimation of influenza epidemics using Google search data via ARGO , 2015, Proceedings of the National Academy of Sciences.

[6]  Tobias Preis,et al.  Adaptive nowcasting of influenza outbreaks using Google searches , 2014, Royal Society Open Science.

[7]  Brian H. Spitzberg,et al.  The Reliability of Tweets as a Supplementary Method of Seasonal Influenza Surveillance , 2014, Journal of medical Internet research.

[8]  J. Bernhardt,et al.  Health information-seeking behaviors, health indicators, and health risks. , 2010, American journal of public health.

[9]  David M. Pennock,et al.  Using internet searches for influenza surveillance. , 2008, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[10]  Yossi Matias,et al.  Norovirus disease surveillance using Google Internet query share data. , 2012, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[11]  R. Tibshirani,et al.  Regression shrinkage and selection via the lasso: a retrospective , 2011 .

[12]  Rishi Desai,et al.  Use of Internet search data to monitor impact of rotavirus vaccination in the United States. , 2012, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[13]  Pinar Karaca-Mandic,et al.  Predicting new diagnoses of HIV infection using internet search engine data. , 2013, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[14]  Maria Deloria-Knoll,et al.  Survival Benefit of Initiating Antiretroviral Therapy in HIV-Infected Persons in Different CD4+ Cell Strata , 2003, Annals of Internal Medicine.

[15]  Shilu Tong,et al.  Using internet search queries for infectious disease surveillance: screening diseases for suitability , 2014, BMC Infectious Diseases.

[16]  L. Ungar,et al.  Future-oriented tweets predict lower county-level HIV prevalence in the United States. , 2015, Health psychology : official journal of the Division of Health Psychology, American Psychological Association.

[17]  Michael J. Paul,et al.  National and Local Influenza Surveillance through Twitter: An Analysis of the 2012-2013 Influenza Epidemic , 2013, PloS one.

[18]  Peter Bühlmann Regression shrinkage and selection via the Lasso: a retrospective (Robert Tibshirani): Comments on the presentation , 2011 .