Document-topic hierarchies from document graphs

Topic taxonomies present a multi-level view of a document collection, where general topics live towards the top of the taxonomy and more specific topics live towards the bottom. Topic taxonomies allow users to quickly drill down into their topic of interest to find documents. We show that hierarchies of documents, where documents live at the inner nodes of the hierarchy-tree can also be inferred by combining document text with inter-document links. We present a Bayesian generative model by which an explicit hierarchy of documents is created. Experiments on three document-graph data sets shows that the generated document hierarchies are able to fit the observed data, and that the levels in the constructed document hierarchy represent practical groupings.

[1]  Michael I. Jordan,et al.  Tree-Structured Stick Breaking for Hierarchical Data , 2010, NIPS.

[2]  Kathryn B. Laskey,et al.  Stochastic blockmodels: First steps , 1983 .

[3]  Ramesh Nallapati,et al.  TopicFlow Model: Unsupervised Learning of Topic-specific Influences of Hyperlinked Documents , 2011, AISTATS.

[4]  Tao Qin,et al.  A study of relevance propagation for web search , 2005, SIGIR '05.

[5]  Deng Cai,et al.  Topic modeling with network regularization , 2008, WWW.

[6]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[7]  Tao Qin,et al.  Microsoft Research Asia at Web Track and Terabyte Track of TREC 2004 , 2004, TREC.

[8]  David M. Blei,et al.  Relational Topic Models for Document Networks , 2009, AISTATS.

[9]  Marius Pasca,et al.  Latent Variable Models of Concept-Attribute Attachment , 2009, ACL/IJCNLP.

[10]  W. Bruce Croft,et al.  A language modeling approach to information retrieval , 1998, SIGIR '98.

[11]  Ashish Goel,et al.  Fast Incremental and Personalized PageRank , 2010, Proc. VLDB Endow..

[12]  Douglas H. Fisher,et al.  Knowledge Acquisition Via Incremental Conceptual Clustering , 1987, Machine Learning.

[13]  Katherine A. Heller,et al.  Bayesian hierarchical clustering , 2005, ICML.

[14]  Thomas L. Griffiths,et al.  The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies , 2007, JACM.

[15]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[16]  Eric P. Xing,et al.  Document hierarchies from text and links , 2012, WWW.

[17]  Michal Rosen-Zvi,et al.  Latent Topic Models for Hypertext , 2008, UAI.

[18]  Kathleen McKeown,et al.  A Hierarchical Model of Web Summaries , 2011, ACL.

[19]  Mitsuru Ishizuka,et al.  Extracting Topics and Innovators Using Topic Diffusion Process in Weblogs , 2008, ICWSM.

[20]  George Karypis,et al.  Hierarchical Clustering Algorithms for Document Datasets , 2005, Data Mining and Knowledge Discovery.

[21]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[22]  Wei Li,et al.  Mixtures of hierarchical topics with Pachinko allocation , 2007, ICML '07.

[23]  Andrew McCallum,et al.  Topic and Role Discovery in Social Networks , 2005, IJCAI.

[24]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[25]  Pat Langley,et al.  Models of Incremental Concept Formation , 1990, Artif. Intell..

[26]  David A. Cohn,et al.  The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity , 2000, NIPS.

[27]  Ramesh Nallapati,et al.  Joint latent topic models for text and citations , 2008, KDD.

[28]  Padhraic Smyth,et al.  Learning concept graphs from text with stick-breaking priors , 2010, NIPS.

[29]  Jie Tang,et al.  ArnetMiner: extraction and mining of academic social networks , 2008, KDD.

[30]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[31]  M. Newman,et al.  Hierarchical structure and the prediction of missing links in networks , 2008, Nature.