SPOTHOT: Scalable Detection of Geo-spatial Events in Large Textual Streams

The analysis of social media data poses several challenges: first of all, the data sets are very large, secondly they change constantly, and third they are heterogeneous, consisting of text, images, geographic locations and social connections. In this article, we focus on detecting events consisting of text and location information, and introduce an analysis method that is scalable both with respect to volume and velocity. We also address the problems arising from differences in adoption of social media across cultures, languages, and countries in our event detection by efficient normalization. We introduce an algorithm capable of processing vast amounts of data using a scalable online approach based on the SigniTrend event detection system, which is able to identify unusual geo-textual patterns in the data stream without requiring the user to specify any constraints in advance, such as hashtags to track: In contrast to earlier work, we are able to monitor every word at every location with just a fixed amount of memory, compare the values to statistics from earlier data and immediately report significant deviations with minimal delay. Thus, this algorithm is capable of reporting "Breaking News" in real-time. Location is modeled using unsupervised geometric discretization and supervised administrative hierarchies, which permits detecting events at city, regional, and global levels at the same time. The usefulness of the approach is demonstrated using several real-world example use cases using Twitter data.

[1]  Nick Koudas,et al.  BlogScope: A System for Online Analysis of High Volume Text Streams , 2007, VLDB.

[2]  Nello Cristianini,et al.  Flu Detector - Tracking Epidemics on Twitter , 2010, ECML/PKDD.

[3]  Giang Binh Tran,et al.  Indexing and analyzing wikipedia's current events portal, the daily news summaries by the crowd , 2014, WWW '14 Companion.

[4]  Timothy M. Chan Approximate Nearest Neighbor Queries Revisited , 1997, SCG '97.

[5]  Noriko Kando,et al.  Applying a Burst Model to Detect Bursty Topics in a Topic Model , 2012, JapTAL.

[6]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[7]  Nick Koudas,et al.  TwitterMonitor: trend detection over the twitter stream , 2010, SIGMOD Conference.

[8]  Yutaka Matsuo,et al.  Earthquake shakes Twitter users: real-time event detection by social sensors , 2010, WWW '10.

[9]  Kazutoshi Sumiya,et al.  Measuring geographical regularities of crowd behaviors for Twitter-based geo-social event detection , 2010, LBSN '10.

[10]  Dimitrios Gunopulos,et al.  Searching for events in the blogosphere , 2009, WWW '09.

[11]  Michael Gertz,et al.  Spatio-temporal characteristics of bursty words in Twitter streams , 2013, SIGSPATIAL/GIS.

[12]  Erich Schubert,et al.  Fast Reverse Geocoder using OpenStreetMap data , 2015 .

[13]  D. H. D. West Updating mean and variance estimates: an improved method , 1979, CACM.

[14]  Seongjoo Lee,et al.  Discovering hot topics using Twitter streaming data social topic detection and geographic clustering , 2013, 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013).

[15]  Nick Koudas,et al.  BlogScope: spatio-temporal analysis of the blogosphere , 2007, WWW '07.

[16]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[17]  Hans-Peter Kriegel,et al.  SigniTrend: scalable detection of emerging topics in textual streams by hashed significance thresholds , 2014, KDD.

[18]  Michael Gertz,et al.  EvenTweet: Online Localized Event Detection from Twitter , 2013, Proc. VLDB Endow..

[19]  Gerhard Weikum,et al.  See what's enBlogue: real-time emergent topic identification in social media , 2012, EDBT '12.

[20]  Arthur Zimek,et al.  A Framework for Clustering Uncertain Data , 2015, Proc. VLDB Endow..

[21]  Yiming Yang,et al.  A study of retrospective and on-line event detection , 1998, SIGIR '98.

[22]  Jon M. Kleinberg,et al.  Bursty and Hierarchical Structure in Streams , 2002, Data Mining and Knowledge Discovery.

[23]  Jure Leskovec,et al.  Meme-tracking and the dynamics of the news cycle , 2009, KDD.

[24]  B. Welford Note on a Method for Calculating Corrected Sums of Squares and Products , 1962 .

[25]  James Allan,et al.  Detections , Bounds , and Timelines : UMass and TDT-3 , 2000 .

[26]  Divyakant Agrawal,et al.  GeoScope: Online Detection of Geo-Correlated Information Trends in Social Networks , 2013, Proc. VLDB Endow..

[27]  Xuemin Lin,et al.  Efficiently identify local frequent keyword co-occurrence patterns in geo-tagged Twitter stream , 2014, SIGIR.

[28]  Wen Li,et al.  Geo-spatial Domain Expertise in Microblogs , 2014, ECIR.

[29]  Daniel Mills,et al.  MillWheel: Fault-Tolerant Stream Processing at Internet Scale , 2013, Proc. VLDB Endow..