TopicSketch: Real-Time Bursty Topic Detection from Twitter

Twitter has become one of the largest microblogging platforms for users around the world to share anything happening around them with friends and beyond. A bursty topic in Twitter is one that triggers a surge of relevant tweets within a short period of time, which often reflects important events of mass interest. How to leverage Twitter for early detection of bursty topics has therefore become an important research problem with immense practical value. Despite the wealth of research work on topic modelling and analysis in Twitter, it remains a challenge to detect bursty topics in real-time. As existing methods can hardly scale to handle the task with the tweet stream in real-time, we propose in this paper <inline-formula><tex-math notation="LaTeX">$\sf {TopicSketch}$</tex-math><alternatives> <inline-graphic xlink:type="simple" xlink:href="xie-ieq1-2556661.gif"/></alternatives></inline-formula>, a sketch-based topic model together with a set of techniques to achieve real-time detection. We evaluate our solution on a tweet stream with over 30 million tweets. Our experiment results show both efficiency and effectiveness of our approach. Especially it is also demonstrated that <inline-formula><tex-math notation="LaTeX">$\sf {TopicSketch}$</tex-math> <alternatives><inline-graphic xlink:type="simple" xlink:href="xie-ieq2-2556661.gif"/></alternatives></inline-formula> on a single machine can potentially handle hundreds of millions tweets per day, which is on the same scale of the total number of daily tweets in Twitter, and present bursty events in finer-granularity.

[1]  Alexander J. Smola,et al.  Online Inference for the Infinite Topic-Cluster Model: Storylines from Streaming Text , 2011, AISTATS.

[2]  Philip S. Yu,et al.  Parameter Free Bursty Events Detection in Text Streams , 2005, VLDB.

[3]  Peter Guttorp,et al.  An Introduction to the Theory of Point Processes (D. J. Daley and D. Vere-Jones) , 1990, SIAM Rev..

[4]  Hua Lu,et al.  A unified model for stable and temporal topic detection from social media data , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[5]  Le Song,et al.  Dirichlet-Hawkes Processes with Applications to Clustering Continuous-Time Document Streams , 2015, KDD.

[6]  Padhraic Smyth,et al.  Adaptive event detection with time-varying poisson processes , 2006, KDD '06.

[7]  Hans-Peter Kriegel,et al.  SigniTrend: scalable detection of emerging topics in textual streams by hashed significance thresholds , 2014, KDD.

[8]  Miles Osborne,et al.  Streaming First Story Detection with application to Twitter , 2010, NAACL.

[9]  Aoying Zhou,et al.  Dynamically maintaining frequent items over a data stream , 2003, CIKM '03.

[10]  Thorsten Brants,et al.  A System for new event detection , 2003, SIGIR.

[11]  Thomas L. Griffiths,et al.  Online Inference of Topics with Latent Dirichlet Allocation , 2009, AISTATS.

[12]  Nadia Magnenat-Thalmann,et al.  Who, where, when and what: discover spatio-temporal topics for twitter users , 2013, KDD.

[13]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[14]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[15]  Gerhard Weikum,et al.  See what's enBlogue: real-time emergent topic identification in social media , 2012, EDBT '12.

[16]  Dimitrios Gunopulos,et al.  Searching for events in the blogosphere , 2009, WWW '09.

[17]  James Allan,et al.  On-Line New Event Detection and Tracking , 1998, SIGIR.

[18]  Luigi Di Caro,et al.  Personalized emerging topic detection based on a term aging model , 2013, ACM Trans. Intell. Syst. Technol..

[19]  Rajeev Motwani,et al.  Approximate Frequency Counts over Data Streams , 2012, VLDB.

[20]  Yutaka Matsuo,et al.  Earthquake shakes Twitter users: real-time event detection by social sensors , 2010, WWW '10.

[21]  D. Stott Parker,et al.  Topic dynamics: an alternative model of bursts in streams of topics , 2010, KDD.

[22]  Bu-Sung Lee,et al.  Event Detection in Twitter , 2011, ICWSM.

[23]  Anima Anandkumar,et al.  Tensor decompositions for learning latent variable models , 2012, J. Mach. Learn. Res..

[24]  Ee-Peng Lim,et al.  Finding Bursty Topics from Microblogs , 2012, ACL.

[25]  Sheldon M. Ross,et al.  Stochastic Processes , 2018, Gauge Integral Structures for Stochastic Calculus and Quantum Electrodynamics.

[26]  Alexander J. Smola,et al.  Discovering geographical topics in the twitter stream , 2012, WWW.

[27]  Jon M. Kleinberg,et al.  Bursty and Hierarchical Structure in Streams , 2002, Data Mining and Knowledge Discovery.

[28]  Chenliang Li,et al.  Twevent: segment-based event detection from tweets , 2012, CIKM.

[29]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[30]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[31]  Liangjie Hong,et al.  A time-dependent topic model for multiple text streams , 2011, KDD.

[32]  Jure Leskovec,et al.  Meme-tracking and the dynamics of the news cycle , 2009, KDD.

[33]  Yiming Yang,et al.  A study of retrospective and on-line event detection , 1998, SIGIR '98.

[34]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[35]  Daniel Barbará,et al.  On-line LDA: Adaptive Topic Models for Mining Text Streams with Applications to Topic Detection and Tracking , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[36]  Noriko Kando,et al.  Applying a Burst Model to Detect Bursty Topics in a Topic Model , 2012, JapTAL.

[37]  Wei Zhang,et al.  STREAMCUBE: Hierarchical spatio-temporal hashtag clustering for event exploration over the Twitter stream , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[38]  Graham Cormode,et al.  What's hot and what's not: tracking most frequent items dynamically , 2003, PODS '03.

[39]  A. Hawkes Spectra of some self-exciting and mutually exciting point processes , 1971 .

[40]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[41]  Richard Sproat,et al.  Mining correlated bursty topic patterns from coordinated text streams , 2007, KDD '07.

[42]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[43]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[44]  Hector Garcia-Molina,et al.  Overview of multidatabase transaction management , 2005, The VLDB Journal.

[45]  Nick Koudas,et al.  TwitterMonitor: trend detection over the twitter stream , 2010, SIGMOD Conference.

[46]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[47]  Chong Wang,et al.  Continuous Time Dynamic Topic Models , 2008, UAI.