Towards Interactive Construction of Topical Hierarchy: A Recursive Tensor Decomposition Approach

Automatic construction of user-desired topical hierarchies over large volumes of text data is a highly desirable but challenging task. This study proposes to give users freedom to construct topical hierarchies via interactive operations such as expanding a branch and merging several branches. Existing hierarchical topic modeling techniques are inadequate for this purpose because (1) they cannot consistently preserve the topics when the hierarchy structure is modified; and (2) the slow inference prevents swift response to user requests. In this study, we propose a novel method, called STROD, that allows efficient and consistent modification of topic hierarchies, based on a recursive generative model and a scalable tensor decomposition inference algorithm with theoretical performance guarantee. Empirical evaluation shows that STROD reduces the runtime of construction by several orders of magnitude, while generating consistent and quality hierarchies.

[1]  Jiawei Han,et al.  Automatic Construction and Ranking of Topical Keyphrases on Collections of Short Documents , 2014, SDM.

[2]  A. McCallum,et al.  Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[3]  Jiawei Han,et al.  Content coverage maximization on word networks for hierarchical topic summarization , 2013, CIKM.

[4]  W. Bruce Croft,et al.  Discovering and Comparing Topic Hierarchies , 2000, RIAO.

[5]  Jay Pujara Large-Scale Hierarchical Topic Models , 2012 .

[6]  Clare R. Voss,et al.  Scalable Topical Phrase Mining from Text Corpora , 2014, Proc. VLDB Endow..

[7]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[8]  Hanna M. Wallach,et al.  Topic modeling: beyond bag-of-words , 2006, ICML.

[9]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[10]  Robert V. Lindsey,et al.  A Phrase-Discovering Topic Model Using Hierarchical Pitman-Yor Processes , 2012, EMNLP.

[11]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[12]  Dongwoo Kim,et al.  Modeling topic hierarchies with the recursive chinese restaurant process , 2012, CIKM.

[13]  Alexander J. Smola,et al.  Nested Chinese Restaurant Franchise Process: Applications to User Tracking and Document Modeling , 2013, ICML.

[14]  Wei Li,et al.  Mixtures of hierarchical topics with Pachinko allocation , 2007, ICML '07.

[15]  Haixun Wang,et al.  Automatic taxonomy construction from keywords , 2012, KDD.

[16]  Max Welling,et al.  Fast collapsed gibbs sampling for latent dirichlet allocation , 2008, KDD.

[17]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Alexander J. Smola,et al.  An architecture for parallel topic models , 2010, Proc. VLDB Endow..

[19]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[20]  Jordan L. Boyd-Graber,et al.  Mr. LDA: a flexible large scale topic modeling package using variational inference in MapReduce , 2012, WWW.

[21]  John D. Lafferty,et al.  Visualizing Topics with Multi-Word Expressions , 2009, 0907.1013.

[22]  P. Strevens Iii , 1985 .

[23]  Andrew McCallum,et al.  Efficient methods for topic model inference on streaming document collections , 2009, KDD.

[24]  Benjamin C. M. Fung,et al.  Hierarchical Document Clustering using Frequent Itemsets , 2003, SDM.

[25]  Yinan Zhang,et al.  A phrase mining framework for recursive construction of a topical hierarchy , 2013, KDD.

[26]  James R. Foulds,et al.  Stochastic collapsed variational Bayesian inference for latent Dirichlet allocation , 2013, KDD.

[27]  Anima Anandkumar,et al.  Tensor decompositions for learning latent variable models , 2012, J. Mach. Learn. Res..

[28]  Alexander J. Smola,et al.  Reducing the sampling complexity of topic models , 2014, KDD.

[29]  Daniel M. Roy,et al.  Complexity of Inference in Latent Dirichlet Allocation , 2011, NIPS.

[30]  Max Welling,et al.  Distributed Algorithms for Topic Models , 2009, J. Mach. Learn. Res..

[31]  Alexander J. Smola,et al.  Online Inference for the Infinite Topic-Cluster Model: Storylines from Streaming Text , 2011, AISTATS.

[32]  Wei Li,et al.  Pachinko allocation: DAG-structured mixture models of topic correlations , 2006, ICML.