Non-parametric scan statistics for event detection and forecasting in heterogeneous social media graphs

Event detection in social media is an important but challenging problem. Most existing approaches are based on burst detection, topic modeling, or clustering techniques, which cannot naturally model the implicit heterogeneous network structure in social media. As a result, only limited information, such as terms and geographic locations, can be used. This paper presents Non-Parametric Heterogeneous Graph Scan (NPHGS), a new approach that considers the entire heterogeneous network for event detection: we first model the network as a "sensor" network, in which each node senses its "neighborhood environment" and reports an empirical p-value measuring its current level of anomalousness for each time interval (e.g., hour or day). Then, we efficiently maximize a nonparametric scan statistic over connected subgraphs to identify the most anomalous network clusters. Finally, the event represented by each cluster is summarized with information such as type of event, geographical locations, time, and participants. As a case study, we consider two applications using Twitter data, civil unrest event detection and rare disease outbreak detection, and present empirical evaluations illustrating the effectiveness and efficiency of our proposed approach.

[1]  Douglas H. Jones,et al.  Goodness-of-fit test statistics that dominate the Kolmogorov statistics , 1979 .

[2]  M. Kulldorff A spatial scan statistic , 1997 .

[3]  D. Donoho,et al.  Higher criticism for detecting sparse heterogeneous mixtures , 2004, math/0410072.

[4]  Jon M. Kleinberg,et al.  Bursty and Hierarchical Structure in Streams , 2002, Data Mining and Knowledge Discovery.

[5]  Hector Garcia-Molina,et al.  Overview of multidatabase transaction management , 2005, The VLDB Journal.

[6]  Philip S. Yu,et al.  Parameter Free Bursty Events Detection in Text Streams , 2005, VLDB.

[7]  Jeff W. Lingwall,et al.  A Nonparametric Scan Statistic for Multivariate Disease Surveillance , 2007 .

[8]  Hanan Samet,et al.  NewsStand: a new view on news , 2008, GIS '08.

[9]  Matthew Hurst,et al.  Event Detection and Tracking in Social Streams , 2009, ICWSM.

[10]  Daniel B Neill,et al.  An empirical comparison of spatial scan statistics for outbreak detection , 2009, International journal of health geographics.

[11]  Brendan T. O'Connor,et al.  A Latent Variable Model for Geographic Lexical Variation , 2010, EMNLP.

[12]  Yutaka Matsuo,et al.  Earthquake shakes Twitter users: real-time event detection by social sensors , 2010, WWW '10.

[13]  Jiawei Han,et al.  Geographical topic discovery and comparison , 2011, WWW.

[14]  Barbara Poblete,et al.  Information credibility on twitter , 2011, WWW.

[15]  Hila Becker,et al.  Beyond Trending Topics: Real-World Event Identification on Twitter , 2011, ICWSM.

[16]  Bu-Sung Lee,et al.  Event Detection in Twitter , 2011, ICWSM.

[17]  Kazufumi Watanabe,et al.  Jasmine: a real-time local-event detection system based on geolocation information propagated to microblogs , 2011, CIKM '11.

[18]  Charu C. Aggarwal,et al.  Event Detection in Social Streams , 2012, SDM.

[19]  Alexander J. Smola,et al.  Discovering geographical topics in the twitter stream , 2012, WWW.

[20]  Dimitrios Gunopulos,et al.  On The Spatiotemporal Burstiness of Terms , 2012, Proc. VLDB Endow..

[21]  Daniel B. Neill,et al.  Fast subset scan for spatial pattern detection , 2012 .

[22]  Daniel B. Neill,et al.  Fast generalized subset scan for anomalous pattern detection , 2013, J. Mach. Learn. Res..

[23]  G. Grimmett,et al.  Cluster detection in networks using percolation , 2011, 1104.0338.