Efficient temporal mining of micro-blog texts and its application to event discovery

In this paper we present a novel method for clustering words in micro-blogs, based on the similarity of the related temporal series. Our technique, named SAX*, uses the Symbolic Aggregate ApproXimation algorithm to discretize the temporal series of terms into a small set of levels, leading to a string for each. We then define a subset of “interesting” strings, i.e. those representing patterns of collective attention. Sliding temporal windows are used to detect co-occurring clusters of tokens with the same or similar string. To assess the performance of the method we first tune the model parameters on a 2-month 1 % Twitter stream, during which a number of world-wide events of differing type and duration (sports, politics, disasters, health, and celebrities) occurred. Then, we evaluate the quality of all discovered events in a 1-year stream, “googling” with the most frequent cluster n-grams and manually assessing how many clusters correspond to published news in the same temporal slot. Finally, we perform a complexity evaluation and we compare SAX* with three alternative methods for event discovery. Our evaluation shows that SAX* is at least one order of magnitude less complex than other temporal and non-temporal approaches to micro-blog clustering.

[1]  Kazutoshi Sumiya,et al.  Measuring geographical regularities of crowd behaviors for Twitter-based geo-social event detection , 2010, LBSN '10.

[2]  Rui Li,et al.  TEDAS: A Twitter-based Event Detection and Analysis System , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[3]  Ee-Peng Lim,et al.  Finding Bursty Topics from Microblogs , 2012, ACL.

[4]  Ciro Cattuto,et al.  Dynamical classes of collective attention in twitter , 2011, WWW.

[5]  Igor Brigadir,et al.  Event Detection in Twitter using Aggressive Filtering and Hierarchical Tweet Clustering , 2014, SNOW-DC@WWW.

[6]  D. Maynard,et al.  Challenges in developing opinion mining tools for social media , 2012 .

[7]  David S. Ebert,et al.  Spatiotemporal social media analytics for abnormal event detection and examination using seasonal-trend decomposition , 2012, 2012 IEEE Conference on Visual Analytics Science and Technology (VAST).

[8]  Yuan Li,et al.  Rotation-invariant similarity in time series using bag-of-patterns representation , 2012, Journal of Intelligent Information Systems.

[9]  Mark Dredze,et al.  How Social Media Will Change Public Health , 2012, IEEE Intelligent Systems.

[10]  ChengXiang Zhai,et al.  Discovering evolutionary theme patterns from text: an exploration of temporal text mining , 2005, KDD '05.

[11]  Joemon M. Jose,et al.  Building a large-scale corpus for evaluating event detection on twitter , 2013, CIKM.

[12]  Ana-Maria Popescu,et al.  Extracting events and event descriptions from Twitter , 2011, WWW.

[13]  Krishna P. Gummadi,et al.  Measuring User Influence in Twitter: The Million Follower Fallacy , 2010, ICWSM.

[14]  Ke Wang,et al.  TopicSketch: Real-Time Bursty Topic Detection from Twitter , 2013, 2013 IEEE 13th International Conference on Data Mining.

[15]  Eamonn J. Keogh,et al.  Locally adaptive dimensionality reduction for indexing large time series databases , 2001, SIGMOD '01.

[16]  Craig MacDonald,et al.  Can Twitter Replace Newswire for Breaking News? , 2013, ICWSM.

[17]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[18]  Hermann Hellwagner,et al.  Automatic sub-event detection in emergency management using social media , 2012, WWW.

[19]  Liangjie Hong,et al.  A time-dependent topic model for multiple text streams , 2011, KDD.

[20]  Qi He,et al.  TwitterRank: finding topic-sensitive influential twitterers , 2010, WSDM '10.

[21]  Miles Osborne,et al.  Streaming First Story Detection with application to Twitter , 2010, NAACL.

[22]  Tao Cheng,et al.  Event Detection using Twitter: A Spatio-Temporal Approach , 2014, PloS one.

[23]  J. Oncina,et al.  INFERRING REGULAR LANGUAGES IN POLYNOMIAL UPDATED TIME , 1992 .

[24]  Jure Leskovec,et al.  Patterns of temporal variation in online media , 2011, WSDM '11.

[25]  Xiaohui Yan,et al.  A biterm topic model for short texts , 2013, WWW.

[26]  Li Wei,et al.  Experiencing SAX: a novel symbolic representation of time series , 2007, Data Mining and Knowledge Discovery.

[27]  Chenliang Li,et al.  Twevent: segment-based event detection from tweets , 2012, CIKM.

[28]  Xun Wang,et al.  Real Time Event Detection in Twitter , 2013, WAIM.

[29]  Michelle X. Zhou,et al.  Event detection with social media data , 2012 .

[30]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[31]  Bu-Sung Lee,et al.  Event Detection in Twitter , 2011, ICWSM.

[32]  Csaba Legány,et al.  Cluster validity measurement techniques , 2006 .

[33]  Brian D. Davison,et al.  Empirical study of topic modeling in Twitter , 2010, SOMA '10.