Online Topic Evolution Modeling Based on Hierarchical Dirichlet Process

This paper presents a model based on Hierarchical Dirichlet Process (HDP), that automatically captures the evolutionary thematic patterns in texts. Our approach allows HDP to work in an online fashion, such that it can build an up-to-date model for new documents given the old model, without accessing historic data. Since exact calculation is infeasible, we turn to Gibbs sampling to carry out approximate posterior inference. After the topics are found, we can analyze the evolution relationships between time-adjacent topics. Experiments on a real world dataset (Reuters-21578) validate the effectiveness of the model quantitatively, showing its advantage over both OLDA and plain HDP in modeling topic evolution.

[1]  Daniel Barbará,et al.  On-line LDA: Adaptive Topic Models for Mining Text Streams with Applications to Topic Detection and Tracking , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[2]  Gregor Heinrich Parameter estimation for text analysis , 2009 .

[3]  David B. Dunson,et al.  The dynamic hierarchical Dirichlet process , 2008, ICML '08.

[4]  J. Sethuraman A CONSTRUCTIVE DEFINITION OF DIRICHLET PRIORS , 1991 .

[5]  Yiming Yang,et al.  Topic Detection and Tracking Pilot Study Final Report , 1998 .

[6]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[7]  Jianwen Zhang,et al.  Evolutionary hierarchical dirichlet processes for multiple correlated time-varying corpora , 2010, KDD.

[8]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[9]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[10]  Myra Spiliopoulou,et al.  Topic Evolution in a Stream of Documents , 2009, SDM.

[11]  Yee Whye Teh,et al.  Dirichlet Process , 2017, Encyclopedia of Machine Learning and Data Mining.

[12]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[13]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[14]  Radford M. Neal Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[15]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.