Modeling topic hierarchies with the recursive chinese restaurant process

Topic models such as latent Dirichlet allocation (LDA) and hierarchical Dirichlet processes (HDP) are simple solutions to discover topics from a set of unannotated documents. While they are simple and popular, a major shortcoming of LDA and HDP is that they do not organize the topics into a hierarchical structure which is naturally found in many datasets. We introduce the recursive Chinese restaurant process (rCRP) and a nonparametric topic model with rCRP as a prior for discovering a hierarchical topic structure with unbounded depth and width. Unlike previous models for discovering topic hierarchies, rCRP allows the documents to be generated from a mixture over the entire set of topics in the hierarchy. We apply rCRP to a corpus of New York Times articles, a dataset of MovieLens ratings, and a set of Wikipedia articles and show the discovered topic hierarchies. We compare the predictive power of rCRP with LDA, HDP, and nested Chinese restaurant process (nCRP) using heldout likelihood to show that rCRP outperforms the others. We suggest two metrics that quantify the characteristics of a topic hierarchy to compare the discovered topic hierarchies of rCRP and nCRP. The results show that rCRP discovers a hierarchy in which the topics become more specialized toward the leaves, and topics in the immediate family exhibit more affinity than topics beyond the immediate family.

[1]  Wei Li,et al.  Pachinko allocation: DAG-structured mixture models of topic correlations , 2006, ICML.

[2]  Noriaki Kawamae Latent interest-topic model: finding the causal relationships behind dyadic data , 2010, CIKM '10.

[3]  S. MacEachern Estimating normal means with a conjugate style dirichlet process prior , 1994 .

[4]  Ivan Titov,et al.  Multi-document topic segmentation , 2010, CIKM.

[5]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[6]  Katherine A. Heller,et al.  Bayesian hierarchical clustering , 2005, ICML.

[7]  Thomas L. Griffiths,et al.  Infinite latent feature models and the Indian buffet process , 2005, NIPS.

[8]  Michael I. Jordan,et al.  Variational inference for Dirichlet process mixtures , 2006 .

[9]  Radford M. Neal,et al.  Density Modeling and Clustering Using Dirichlet Diffusion Trees , 2003 .

[10]  Michael I. Jordan,et al.  Tree-Structured Stick Breaking for Hierarchical Data , 2010, NIPS.

[11]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[12]  Bo Gao,et al.  Topic-level social network search , 2011, KDD.

[13]  John D. Lafferty,et al.  Correlated Topic Models , 2005, NIPS.

[14]  Christopher K. I. Williams A MCMC Approach to Hierarchical Mixture Modelling , 1999, NIPS.

[15]  M. Escobar,et al.  Bayesian Density Estimation and Inference Using Mixtures , 1995 .

[16]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[17]  Dongwoo Kim,et al.  Accounting for data dependencies within a hierarchical dirichlet process mixture model , 2011, CIKM '11.

[18]  T. Rubin A Topic Model For Movie Choices and Ratings , 2009 .

[19]  Lancelot F. James,et al.  Approximate Dirichlet Process Computing in Finite Normal Mixtures , 2002 .

[20]  Yee Whye Teh,et al.  Collapsed Variational Inference for HDP , 2007, NIPS.

[21]  Wei Li,et al.  Mixtures of hierarchical topics with Pachinko allocation , 2007, ICML '07.

[22]  Allan Collins,et al.  A spreading-activation theory of semantic processing , 1975 .

[23]  Yong Yu,et al.  A topical link model for community discovery in textual interaction graph , 2010, CIKM.

[24]  Ruslan Salakhutdinov,et al.  Evaluation methods for topic models , 2009, ICML '09.