Analysing Twitter and web queries for flu trend prediction

BackgroundSocial media platforms encourage people to share diverse aspects of their daily life. Among these, shared health related information might be used to infer health status and incidence rates for specific conditions or symptoms. In this work, we present an infodemiology study that evaluates the use of Twitter messages and search engine query logs to estimate and predict the incidence rate of influenza like illness in Portugal.ResultsBased on a manually classified dataset of 2704 tweets from Portugal, we selected a set of 650 textual features to train a Naïve Bayes classifier to identify tweets mentioning flu or flu-like illness or symptoms. We obtained a precision of 0.78 and an F-measure of 0.83, based on cross validation over the complete annotated set. Furthermore, we trained a multiple linear regression model to estimate the health-monitoring data from the Influenzanet project, using as predictors the relative frequencies obtained from the tweet classification results and from query logs, and achieved a correlation ratio of 0.89 (p < 0.001). These classification and regression models were also applied to estimate the flu incidence in the following flu season, achieving a correlation of 0.72.ConclusionsPrevious studies addressing the estimation of disease incidence based on user-generated content have mostly focused on the english language. Our results further validate those studies and show that by changing the initial steps of data preprocessing and feature extraction and selection, the proposed approaches can be adapted to other languages. Additionally, we investigated whether the predictive model created can be applied to data from the subsequent flu season. In this case, although the prediction result was good, an initial phase to adapt the regression model could be necessary to achieve more robust results.

[1]  E. Larson,et al.  Dissemination of health information through social networks: twitter and antibiotics. , 2010, American journal of infection control.

[2]  Gunther Eysenbach,et al.  Infodemiology: Tracking Flu-Related Searches on the Web for Syndromic Surveillance , 2006, AMIA.

[3]  J. Brownstein,et al.  Social and news media enable estimation of epidemiological patterns early in the 2010 Haitian cholera outbreak. , 2012, The American journal of tropical medicine and hygiene.

[4]  Mizuki Morita,et al.  Twitter Catches The Flu: Detecting Influenza Epidemics using Twitter , 2011, EMNLP.

[6]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[7]  G. Eysenbach Infodemiology and Infoveillance: Framework for an Emerging Set of Public Health Informatics Methods to Analyze Search, Communication and Publication Behavior on the Internet , 2009, Journal of medical Internet research.

[8]  A Lyon,et al.  Comparison of web-based biosecurity intelligence systems: BioCaster, EpiSPIDER and HealthMap. , 2012, Transboundary and emerging diseases.

[9]  Mark Dredze,et al.  You Are What You Tweet: Analyzing Twitter for Public Health , 2011, ICWSM.

[10]  Laura Schweitzer,et al.  Advances In Kernel Methods Support Vector Learning , 2016 .

[11]  Alberto Maria Segre,et al.  The Use of Twitter to Track Levels of Disease Activity and Public Concern in the U.S. during the Influenza A H1N1 Pandemic , 2011, PloS one.

[12]  Jeremy Ginsberg,et al.  Detecting influenza epidemics using search engine query data , 2009, Nature.

[13]  T. Bernardo,et al.  Scoping Review on Search Queries and Social Media for Disease Surveillance: A Chronology of Innovation , 2013, Journal of medical Internet research.

[14]  Karan P. Singh,et al.  Theoretical Biology and Medical Modelling , 2007 .

[15]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[16]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[17]  Aron Culotta,et al.  Detecting influenza outbreaks by analyzing Twitter messages , 2010, ArXiv.

[18]  Aron Culotta,et al.  Towards detecting influenza epidemics by analyzing Twitter messages , 2010, SOMA '10.

[19]  Nello Cristianini,et al.  Tracking the flu pandemic by monitoring the social web , 2010, 2010 2nd International Workshop on Cognitive Information Processing.

[20]  Alok N. Choudhary,et al.  Real-time disease surveillance using Twitter data: demonstration on flu and cancer , 2013, KDD.

[21]  G. Eysenbach,et al.  Pandemics in the Age of Twitter: Content Analysis of Tweets during the 2009 H1N1 Outbreak , 2010, PloS one.