A general method for estimating the prevalence of influenza-like-symptoms with Wikipedia data

Influenza is an acute respiratory seasonal disease that affects millions of people worldwide and causes thousands of deaths in Europe alone. Being able to estimate in a fast and reliable way the impact of an illness on a given country is essential to plan and organize effective countermeasures, which is now possible by leveraging unconventional data sources like web searches and visits. In this study, we show the feasibility of exploiting information about Wikipedia's page views of a selected group of articles and machine learning models to obtain accurate estimates of influenza-like illnesses incidence in four European countries: Italy, Germany, Belgium, and the Netherlands. We propose a novel language-agnostic method, based on two algorithms, Personalized PageRank and CycleRank, to automatically select the most relevant Wikipedia pages to be monitored without the need for expert supervision. We then show how our model is able to reach state-of-the-art results by comparing it with previous solutions.

[1]  Michaël Laurent,et al.  Research Paper: Seeking Health Information Online: Does Wikipedia Matter? , 2009, J. Am. Medical Informatics Assoc..

[2]  Noriko Hara,et al.  Global Wikipedia: International and Cross-Cultural Issues in Online Collaboration , 2014 .

[3]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[4]  Anselm Spoerri,et al.  What is popular on Wikipedia and why? , 2007, First Monday.

[5]  Y. Gel,et al.  Influenza Forecasting with Google Flu Trends , 2013, PloS one.

[6]  Alberto Montresor,et al.  CycleRank, or there and back again: personalized relevance scores from cyclic paths on directed graphs , 2020, Proceedings of the Royal Society A.

[7]  Samy A Azer,et al.  Accuracy and readability of cardiovascular entries on Wikipedia: are they reliable learning resources for medical students? , 2015, BMJ Open.

[8]  Kwok-Leung Tsui,et al.  Forecasting influenza in Hong Kong with Google search queries and statistical model fusion , 2017, PloS one.

[9]  Mirco Musolesi,et al.  Are you getting sick? Predicting influenza-like symptoms using human mobility behaviors , 2017, EPJ Data Science.

[10]  Daniela Perrotta,et al.  Using Participatory Web-based Surveillance Data to Improve Seasonal Influenza Forecasting in Italy , 2017, WWW.

[11]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[12]  M. Santillana,et al.  What can digital disease detection learn from (an external revision to) Google Flu Trends? , 2014, American journal of preventive medicine.

[13]  D. Lazer,et al.  The Parable of Google Flu: Traps in Big Data Analysis , 2014, Science.

[14]  David F. Gleich,et al.  PageRank beyond the Web , 2014, SIAM Rev..

[15]  Ashlynn R. Daughton,et al.  Measuring Global Disease with Wikipedia: Success, Failure, and a Research Agenda , 2017, CSCW.

[16]  Srinivasan Venkatramanan,et al.  Forecasting influenza activity using machine-learned mobility map , 2021, Nature Communications.

[17]  Mark Graham,et al.  The most controversial topics in Wikipedia: A multilingual and geographical analysis , 2013, ArXiv.

[18]  Mauricio Santillana,et al.  ARGO: a model for accurate estimation of influenza epidemics using Google search data , 2015, ArXiv.

[19]  Sang-Goo Lee,et al.  A Survey on Personalized PageRank Computation Algorithms , 2019, IEEE Access.

[20]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[21]  Sen Pei,et al.  Forecasting influenza in Europe using a metapopulation model incorporating cross-border commuting and air travel , 2020, PLoS Comput. Biol..

[22]  Harish Nair,et al.  Global burden of respiratory infections due to seasonal influenza in young children: a systematic review and meta-analysis , 2011, The Lancet.

[23]  Lynnette Brammer,et al.  Estimates of US influenza‐associated deaths made using four different methods , 2009, Influenza and other respiratory viruses.

[24]  James M Heilman,et al.  Wikipedia: A Key Tool for Global Public Health Promotion , 2011, Journal of medical Internet research.

[25]  Eva L. Dyer,et al.  Pyglmnet: Python implementation of elastic-net regularized generalized linear models , 2020, J. Open Source Softw..

[26]  Alberto Maria Segre,et al.  The Use of Twitter to Track Levels of Disease Activity and Public Concern in the U.S. during the Influenza A H1N1 Pandemic , 2011, PloS one.

[27]  Richard A. Johnson,et al.  A new family of power transformations to improve normality or symmetry , 2000 .

[28]  Alina Deshpande,et al.  Global Disease Monitoring and Forecasting with Wikipedia , 2014, PLoS Comput. Biol..

[29]  Gregor E. Kennedy,et al.  Expediency-based practice? Medical students' reliance on Google and Wikipedia for biomedical inquiries , 2011, Br. J. Educ. Technol..

[30]  Andrew G. West,et al.  Wikipedia and Medicine: Quantifying Readership, Editors, and the Significance of Natural Language , 2015, Journal of medical Internet research.

[31]  Jang Seok Oh,et al.  Use of Hangeul Twitter to Track and Predict Human Influenza Infection , 2013, PloS one.

[32]  Skipper Seabold,et al.  Statsmodels: Econometric and Statistical Modeling with Python , 2010, SciPy.

[33]  Sérgio Matos,et al.  Analysing Twitter and web queries for flu trend prediction , 2014, Theoretical Biology and Medical Modelling.

[34]  David C. Farrow,et al.  Results from the second year of a collaborative effort to forecast influenza seasons in the United States. , 2018, Epidemics.

[35]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[36]  Chang-Gun Lee,et al.  Estimating Influenza Outbreaks Using Both Search Engine Query Data and Social Media Data in South Korea , 2016, Journal of medical Internet research.

[37]  Mark Dredze,et al.  Combining Search, Social Media, and Traditional Data Sources to Improve Influenza Surveillance , 2015, PLoS Comput. Biol..

[38]  Keiji Fukuda,et al.  Mortality associated with influenza and respiratory syncytial virus in the United States. , 2003, JAMA.

[39]  John S. Brownstein,et al.  Wikipedia Usage Estimates Prevalence of Influenza-Like Illness in the United States in Near Real-Time , 2014, PLoS Comput. Biol..

[40]  Ellyn Ayton,et al.  Forecasting influenza-like illness dynamics for military populations using neural networks and social media , 2017, PloS one.

[41]  Jure Leskovec,et al.  Mining Missing Hyperlinks from Human Navigation Traces: A Case Study of Wikipedia , 2015, WWW.