An Evolutionary Methodology for Handling Data Scarcity and Noise in Monitoring Real Events from Social Media Data

Every day text-based social media channels are flooded with millions of messages that comprise the most diverse topics. These channels are being used as a rich data source for monitoring different real world events such as natural disasters and disease outbreaks, to name a few. However, depending on the event being investigated, this monitoring may be severely affected by data scarcity and noise, allowing just coarse grain analysis in terms of time and space, which lack the specificity necessary for supporting actions at the local level. In this context, we present a methodology to handle data scarcity and noise while monitoring real world events using social media data in a fine grain. We apply our methodology to dengue-related data from Brazil, and show how it could improve significantly the performance of event monitoring at a local scale almost doubling the correlation observed in some cases.

[1]  Jeremy Ginsberg,et al.  Detecting influenza epidemics using search engine query data , 2009, Nature.

[2]  David M. Pennock,et al.  Using internet searches for influenza surveillance. , 2008, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[3]  G. Eysenbach,et al.  Pandemics in the Age of Twitter: Content Analysis of Tweets during the 2009 H1N1 Outbreak , 2010, PloS one.

[4]  Yutaka Matsuo,et al.  Earthquake shakes Twitter users: real-time event detection by social sensors , 2010, WWW '10.

[5]  Benyuan Liu,et al.  Predicting Flu Trends using Twitter data , 2011, 2011 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS).

[6]  Benyuan Liu,et al.  Twitter Improves Seasonal Influenza Prediction , 2018, HEALTHINF.

[7]  Virgílio A. F. Almeida,et al.  Dengue surveillance based on a computational model of spatio-temporal locality of Twitter , 2011, WebSci '11.

[8]  Heikki Mannila,et al.  The power of sampling in knowledge discovery , 1994, PODS '94.

[9]  Aron Culotta,et al.  Towards detecting influenza epidemics by analyzing Twitter messages , 2010, SOMA '10.

[10]  Sérgio Matos,et al.  Analysing Twitter and web queries for flu trend prediction , 2014, Theoretical Biology and Medical Modelling.

[11]  Mohammed J. Zaki,et al.  Lazy Associative Classification , 2006, Sixth International Conference on Data Mining (ICDM'06).

[12]  A. E. Eiben,et al.  Introduction to Evolutionary Computing , 2003, Natural Computing Series.

[13]  D. Cummings,et al.  Prediction of Dengue Incidence Using Search Query Surveillance , 2011, PLoS neglected tropical diseases.

[14]  Nello Cristianini,et al.  Tracking the flu pandemic by monitoring the social web , 2010, 2010 2nd International Workshop on Cognitive Information Processing.

[15]  Matthew Mohebbi,et al.  Assessing Google Flu Trends Performance in the United States during the 2009 Influenza Virus A (H1N1) Pandemic , 2011, PloS one.

[16]  David Schlossberg,et al.  Clinical Infectious Disease: Clinical Syndromes – Respiratory Tract , 2008 .

[17]  Kalyanmoy Deb,et al.  Simulated Binary Crossover for Continuous Search Space , 1995, Complex Syst..

[18]  Marc Parizeau,et al.  DEAP: evolutionary algorithms made easy , 2012, J. Mach. Learn. Res..

[19]  A. Hulth,et al.  Web Queries as a Source for Syndromic Surveillance , 2009, PloS one.