Tracking and Connecting Topics via Incremental Hierarchical Dirichlet Processes

Much research has been devoted to topic detection from text, but one major challenge has not been addressed: revealing the rich relationships that exist among the detected topics. Finding such relationships is important since many applications are interested in how topics come into being, how they develop, grow, disintegrate, and finally disappear. In this paper, we present a novel method that reveals the connections between topics discovered from the text data. Specifically, our method focuses on how one topic splits into multiple topics, and how multiple topics merge into one topic. We adopt the hierarchical Dirichlet process (HDP) model, and propose an incremental Gibbs sampling algorithm to incrementally derive and refine the labels of clusters. We then characterize the splitting and merging patterns among clusters based on how labels change. We propose a global analysis process that focuses on cluster splitting and merging, and a finer granularity analysis process that helps users to better understand the content of the clusters and the evolution patterns. We also develop a visualization process to present the results.

[1]  Qiang Zhang,et al.  TIARA: a visual exploratory text analytic system , 2010, KDD '10.

[2]  Yun Chi,et al.  Evolutionary spectral clustering by incorporating temporal smoothness , 2007, KDD '07.

[3]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[4]  Emin Orhan Dirichlet Processes , 2012 .

[5]  Yiming Yang,et al.  A Probabilistic Model for Online Document Clustering with Application to Novelty Detection , 2004, NIPS.

[6]  Deepayan Chakrabarti,et al.  Evolutionary clustering , 2006, KDD '06.

[7]  Shimei Pan,et al.  Topic and keyword re-ranking for LDA-based topic modeling , 2009, CIKM.

[8]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[9]  Michael W. Berry,et al.  Text mining : applications and theory , 2010 .

[10]  James Allan,et al.  Topic detection and tracking: event-based information organization , 2002 .

[11]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[12]  Jianwen Zhang,et al.  Evolutionary hierarchical dirichlet processes for multiple correlated time-varying corpora , 2010, KDD.

[13]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[14]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[15]  J. Sethuraman A CONSTRUCTIVE DEFINITION OF DIRICHLET PRIORS , 1991 .

[16]  Thomas L. Griffiths,et al.  Online Inference of Topics with Latent Dirichlet Allocation , 2009, AISTATS.

[17]  M. Escobar,et al.  Bayesian Density Estimation and Inference Using Mixtures , 1995 .