Mining evolutionary multi-branch trees from text streams

Understanding topic hierarchies in text streams and their evolution patterns over time is very important in many applications. In this paper, we propose an evolutionary multi-branch tree clustering method for streaming text data. We build evolutionary trees in a Bayesian online filtering framework. The tree construction is formulated as an online posterior estimation problem, which considers both the likelihood of the current tree and conditional prior given the previous tree. We also introduce a constraint model to compute the conditional prior of a tree in the multi-branch setting. Experiments on real world news data demonstrate that our algorithm can better incorporate historical tree information and is more efficient and effective than the traditional evolutionary hierarchical clustering algorithm.

[1]  Richard Sproat,et al.  Mining correlated bursty topic patterns from coordinated text streams , 2007, KDD '07.

[2]  Kai Zhang,et al.  Mining common topics from multiple asynchronous text streams , 2009, WSDM '09.

[3]  Xin Tong,et al.  TextFlow: Towards Better Understanding of Evolving Topics in Text , 2011, IEEE Transactions on Visualization and Computer Graphics.

[4]  Haixun Wang,et al.  Tracking and Connecting Topics via Incremental Hierarchical Dirichlet Processes , 2011, 2011 IEEE 11th International Conference on Data Mining.

[5]  Haixun Wang,et al.  Automatic taxonomy construction from keywords , 2012, KDD.

[6]  Yee Whye Teh,et al.  Discovering Nonbinary Hierarchical Structures with Bayesian Rose Trees , 2011 .

[7]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[8]  Andreas Nürnberger,et al.  Creating a Cluster Hierarchy under Constraints of a Partially Known Hierarchy , 2008, SDM.

[9]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[10]  Haifeng Zhao,et al.  Hierarchical Agglomerative Clustering with Ordering Constraints , 2010, 2010 Third International Conference on Knowledge Discovery and Data Mining.

[11]  Deepayan Chakrabarti,et al.  Evolutionary clustering , 2006, KDD '06.

[12]  S. S. Ravi,et al.  Using instance-level constraints in agglomerative hierarchical clustering: theoretical and empirical results , 2009, Data Mining and Knowledge Discovery.

[13]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[14]  Thorsten Joachims,et al.  Learning a Distance Metric from Relative Comparisons , 2003, NIPS.

[15]  Alexander J. Smola,et al.  Unified analysis of streaming news , 2011, WWW.

[16]  Wei Wang,et al.  Clustering with relative constraints , 2011, KDD.

[17]  Tao Li,et al.  Semi-supervised Hierarchical Clustering , 2011, 2011 IEEE 11th International Conference on Data Mining.

[18]  Nicholas C. Wormald,et al.  Reconstruction of Rooted Trees From Subtrees , 1996, Discret. Appl. Math..

[19]  Philip S. Yu,et al.  Dirichlet Process Based Evolutionary Clustering , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[20]  David Kauchak,et al.  Modeling word burstiness using the Dirichlet distribution , 2005, ICML.

[21]  Eric P. Xing,et al.  Timeline: A Dynamic Hierarchical Dirichlet Process Model for Recovering Birth/Death and Evolution of Topics in Text Stream , 2010, UAI.

[22]  Jianwen Zhang,et al.  Evolutionary hierarchical dirichlet processes for multiple correlated time-varying corpora , 2010, KDD.

[23]  Jiawei Han,et al.  Topic modeling for OLAP on multidimensional text databases: topic cube and its applications , 2009, Stat. Anal. Data Min..

[24]  Yu Lin,et al.  A Metric for Phylogenetic Trees Based on Matching , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[25]  Eric P. Xing,et al.  Dynamic Non-Parametric Mixture Models and the Recurrent Chinese Restaurant Process: with Applications to Evolutionary Clustering , 2008, SDM.

[26]  Yee Whye Teh,et al.  Bayesian Rose Trees , 2010, UAI.

[27]  Philip S. Yu,et al.  Evolutionary Clustering by Hierarchical Dirichlet Process with Hidden Markov State , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[28]  Raymond J. Mooney,et al.  A probabilistic framework for semi-supervised clustering , 2004, KDD.

[29]  Deepa Paranjpe,et al.  Semi-supervised clustering with metric learning using relative comparisons , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[30]  Sadaaki Miyamoto,et al.  Constrained agglomerative hierarchical clustering algorithms with penalties , 2011, 2011 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE 2011).