How to Exploit Twitter for Public Health Monitoring?

OBJECTIVES Detecting hints to public health threats as early as possible is crucial to prevent harm from the population. However, many disease surveillance strategies rely upon data whose collection requires explicit reporting (data transmitted from hospitals, laboratories or physicians). Collecting reports takes time so that the reaction time grows. Moreover, context information on individual cases is often lost in the collection process. This paper describes a system that tries to address these limitations by processing social media for identifying information on public health threats. The primary objective is to study the usefulness of the approach for supporting the monitoring of a population's health status. METHODS The developed system works in three main steps: Data from Twitter, blogs, and forums as well as from TV and radio channels are continuously collected and filtered by means of keyword lists. Sentences of relevant texts are classified relevant or irrelevant using a binary classifier based on support vector machines. By means of statistical methods known from biosurveillance, the relevant sentences are further analyzed and signals are generated automatically when unexpected behavior is detected. From the generated signals a subset is selected for presentation to a user by matching with user queries or profiles. In a set of evaluation experiments, public health experts assessed the generated signals with respect to correctness and relevancy. In particular, it was assessed how many relevant and irrelevant signals are generated during a specific time period. RESULTS The experiments show that the system provides information on health events identified in social media. Signals are mainly generated from Twitter messages posted by news agencies. Personal tweets, i.e. tweets from persons observing some symptoms, only play a minor role for signal generation given a limited volume of relevant messages. Relevant signals referring to real world outbreaks were generated by the system and monitored by epidemiologists for example during the European football championship. But, the number of relevant signals among generated signals is still very small: The different experiments yielded a proportion between 5 and 20% of signals regarded as "relevant" by the users. Vaccination or education campaigns communicated via Twitter as well as use of medical terms in other contexts than for outbreak reporting led to the generation of irrelevant signals. CONCLUSIONS The aggregation of information into signals results in a reduction of monitoring effort compared to other existing systems. Against expectations, only few messages are of personal nature, reporting on personal symptoms. Instead, media reports are distributed over social media channels. Despite the high percentage of irrelevant signals generated by the system, the users reported that the effort in monitoring aggregated information in form of signals is less demanding than monitoring huge social-media data streams manually. It remains for the future to develop strategies for reducing false alarms.

[1]  S J Stanhope,et al.  Exploiting Online Discussions to Discover Unrecognized Drug Side Effects , 2013, Methods of Information in Medicine.

[2]  Son Doan,et al.  BioCaster: detecting public health rumors with a Web-based text mining system , 2008, Bioinform..

[3]  Eleftherios Mylonakis,et al.  Google trends: a web-based tool for real-time surveillance of disease outbreaks. , 2009, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[4]  Armin R. Mikler,et al.  Text and Structural Data Mining of Influenza Mentions in Web and Social Media , 2010, International journal of environmental research and public health.

[5]  Nigel Collier,et al.  Uncovering text mining: A survey of current work on web-based epidemic intelligence , 2012, Global public health.

[6]  G Hartvigsen,et al.  Inferring Community Structure in Healthcare Forums , 2013, Methods of Information in Medicine.

[7]  Gerald Quirchmayr,et al.  Open Source Intelligence in Disaster Management , 2012, 2012 European Intelligence and Security Informatics Conference.

[8]  Peter Dolog,et al.  Improving tensor based recommenders with clustering , 2012, UMAP.

[9]  Frederico Araújo Durão,et al.  Towards effective group recommendations for microblogging users , 2012, SAC '12.

[10]  John S. Brownstein,et al.  The Landscape of International Biosurveillance , 2010 .

[11]  Matti Vuorinen,et al.  Assessment of Utility in Web Mining for the Domain of Public Health , 2010, Louhi@NAACL-HLT.

[12]  A. Dugas,et al.  Google Flu Trends: correlation with emergency department influenza rates and crowding metrics. , 2011, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[13]  Roberto Basili,et al.  Semantic Role Labeling via Tree Kernel Joint Inference , 2006, CoNLL.

[14]  L. Madoff ProMED-mail: an early warning system for emerging diseases. , 2004, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[15]  Michael Höhle,et al.  An R-package for the surveillance of infectious diseases , 2006 .

[16]  Steinberger Ralf,et al.  MedISys - Medical Information System , 2010 .

[17]  Selen Bozkurt,et al.  Can Social Web Help o Detect Influenza Related Illnesses in Turkey? , 2012, EFMI-STC.

[18]  Emily H. Chan,et al.  Using Web Search Query Data to Monitor Dengue Epidemics: A New Model for Neglected Tropical Disease Surveillance , 2011, PLoS neglected tropical diseases.

[19]  Ralph Grishman,et al.  Information extraction for enhanced access to disease outbreak reports , 2002, J. Biomed. Informatics.

[20]  Herman D. Tolentino,et al.  Use of Unstructured Event-Based Reports for Global Infectious Disease Surveillance , 2009, Emerging infectious diseases.

[21]  Thomas Gottron,et al.  Document Word Clouds: Visualising Web Documents as Tag Clouds to Aid Users in Relevance Decisions , 2009, ECDL.

[22]  Nick Andrews,et al.  A Statistical Algorithm for the Early Detection of Outbreaks of Infectious Disease , 1996 .

[23]  Ralf Steinberger,et al.  Text Mining from the Web for Medical Intelligence , 2007, NATO ASI Mining Massive Data Sets for Security.

[24]  D. Coulombier,et al.  Epidemic intelligence: a new framework for strengthening disease surveillance in Europe. , 2006, Euro surveillance : bulletin Europeen sur les maladies transmissibles = European communicable disease bulletin.

[25]  Matthew Mohebbi,et al.  Assessing Google Flu Trends Performance in the United States during the 2009 Influenza Virus A (H1N1) Pandemic , 2011, PloS one.

[26]  Nik Bessis,et al.  Advanced ICTs for Disaster Management and Threat Detection: Collaborative and Distributed Frameworks , 2010 .

[27]  G. Rossi,et al.  An approximate CUSUM procedure for surveillance of health events. , 1999, Statistics in medicine.

[28]  Gunther Eysenbach,et al.  Infodemiology and infoveillance tracking online health information and cyberbehavior for public health. , 2011, American journal of preventive medicine.

[29]  L Fernandez-Luque,et al.  The Role of Taxonomies in Social Media and the Semantic Web for Health Education , 2013, Methods of Information in Medicine.

[30]  Avare Stewart,et al.  A transfer approach to detecting disease reporting events in blog social media , 2011, HT '11.

[31]  K. Denecke,et al.  Web science in medicine and healthcare. , 2013, Methods of information in medicine.

[32]  Michael Höhle,et al.  surveillance: An R package for the monitoring of infectious diseases , 2007, Comput. Stat..