Detecting Topic Drift with Compound Topic Models

The Latent Dirichlet Allocation topic model of Blei, Ng, & Jordan (2003) is well-established as an effective approach to recovering meaningful topics of conversation from a set of documents. However, a useful analysis of user-generated content is concerned not only with the recovery of topics from a static data set, but with the evolution of topics over time. We employ a compound topic model (CTM) to track topics across two distinct data sets (i.e. past and present) and to visualize trends in topics over time; we evaluate several metrics for detecting a change in the distribution of topics within a time-window; and we illustrate how our approach discovers emerging conversation topics related to current events in real data sets.

[1]  Ramesh Nallapati,et al.  Link-PLSA-LDA: A New Unsupervised Model for Topics and Influence of Blogs , 2021, ICWSM.

[2]  Ivan Titov,et al.  Modeling online reviews with multi-grain topic models , 2008, WWW.

[3]  Regina Barzilay,et al.  Learning Document-Level Semantic Properties from Free-Text Annotations , 2008, ACL.

[4]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[5]  James Allan,et al.  Topic detection and tracking: event-based information organization , 2002 .

[6]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Matthew Hurst,et al.  BlogPulse: Automated Trend Discovery for Weblogs , 2003 .

[8]  Jon M. Kleinberg,et al.  Bursty and Hierarchical Structure in Streams , 2002, Data Mining and Knowledge Discovery.

[9]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[10]  George G. Robertson,et al.  Narratives: A visualization to track narrative events as they develop , 2008, 2008 IEEE Symposium on Visual Analytics Science and Technology.

[11]  Jeonghee Yi,et al.  Detecting buzz from time-sequenced document streams , 2005, 2005 IEEE International Conference on e-Technology, e-Commerce and e-Service.

[12]  Ivan Titov,et al.  A Joint Model of Text and Aspect Ratings for Sentiment Summarization , 2008, ACL.

[13]  Franco Salvetti,et al.  Efficient spam analysis for weblogs through URL segmentation , 2007 .

[14]  Regina Barzilay,et al.  Catching the Drift: Probabilistic Content Models, with Applications to Generation and Summarization , 2004, NAACL.