Streaming-LDA: A Copula-based Approach to Modeling Topic Dependencies in Document Streams

We propose in this paper two new models for modeling topic and word-topic dependencies between consecutive documents in document streams. The first model is a direct extension of Latent Dirichlet Allocation model (LDA) and makes use of a Dirichlet distribution to balance the influence of the LDA prior parameters wrt to topic and word-topic distribution of the previous document. The second extension makes use of copulas, which constitute a generic tools to model dependencies between random variables. We rely here on Archimedean copulas, and more precisely on Franck copulas, as they are symmetric and associative and are thus appropriate for exchangeable random variables. Our experiments, conducted on three standard collections that have been used in several studies on topic modeling, show that our proposals outperform previous ones (as dynamic topic models and temporal \LDA), both in terms of perplexity and for tracking similar topics in a document stream.

[1]  Liangjie Hong,et al.  A time-dependent topic model for multiple text streams , 2011, KDD.

[2]  Shonali Krishnaswamy,et al.  Mining data streams: a review , 2005, SGMD.

[3]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[4]  Mark Dredze,et al.  You Are What You Tweet: Analyzing Twitter for Public Health , 2011, ICWSM.

[5]  Eugene Agichtein,et al.  TM-LDA: efficient online modeling of latent topic transitions in social media , 2012, KDD.

[6]  Andrew McCallum,et al.  Efficient methods for topic model inference on streaming document collections , 2009, KDD.

[7]  Alexander J. McNeil,et al.  Multivariate Archimedean copulas, $d$-monotone functions and $\ell_1$-norm symmetric distributions , 2009, 0908.3750.

[8]  Bill Ravens,et al.  An Introduction to Copulas , 2000, Technometrics.

[9]  G. Tian,et al.  Dirichlet and Related Distributions: Theory, Methods and Applications , 2011 .

[10]  Naonori Ueda,et al.  Topic Tracking Model for Analyzing Consumer Purchase Behavior , 2009, IJCAI.

[11]  Yarema Okhrin,et al.  Properties of hierarchical Archimedean copulas , 2013 .

[12]  A. McNeil Sampling nested Archimedean copulas , 2008 .

[13]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[14]  R. Nelsen An Introduction to Copulas (Springer Series in Statistics) , 2006 .

[15]  Jimeng Sun,et al.  Dynamic Mixture Models for Multiple Time-Series , 2007, IJCAI.

[16]  Huidong Jin,et al.  Sequential Latent Dirichlet Allocation: Discover Underlying Topic Structures within a Document , 2010, 2010 IEEE International Conference on Data Mining.

[17]  Yi Wang Distributed Gibbs Sampling of Latent Topic Models : The Gritty Details THIS IS AN EARLY DRAFT . YOUR FEEDBACKS ARE HIGHLY APPRECIATED , 2011 .

[18]  Andrew McCallum,et al.  Rethinking LDA: Why Priors Matter , 2009, NIPS.

[19]  Stéphane Derrode,et al.  Unsupervised data classification using pairwise Markov chains with automatic copulas selection , 2013, Comput. Stat. Data Anal..

[20]  Yutaka Matsuo,et al.  Earthquake shakes Twitter users: real-time event detection by social sensors , 2010, WWW '10.

[21]  Radford M. Neal Slice Sampling , 2003, The Annals of Statistics.

[22]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[23]  References , 1971 .

[24]  Yee Whye Teh,et al.  On Smoothing and Inference for Topic Models , 2009, UAI.

[25]  Arindam Banerjee,et al.  Topic Models over Text Streams: A Study of Batch and Online Unsupervised Learning , 2007, SDM.

[26]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..