Latent Topic Models for Hypertext

Latent topic models have been successfully applied as an unsupervised topic discovery technique in large document collections. With the proliferation of hypertext document collection such as the Internet, there has also been great interest in extending these approaches to hypertext [6, 9]. These approaches typically model links in an analogous fashion to how they model words - the document-link co-occurrence matrix is modeled in the same way that the document-word co-occurrence matrix is modeled in standard topic models. In this paper we present a probabilistic generative model for hypertext document collections that explicitly models the generation of links. Specifically, links from a word w to a document d depend directly on how frequent the topic of w is in d, in addition to the in-degree of d. We show how to perform EM learning on this model efficiently. By not modeling links as analogous to words, we end up using far fewer free parameters and obtain better link prediction results.

[1]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[2]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[3]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[4]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[5]  David Cohn,et al.  Learning to Probabilistically Identify Authoritative Documents , 2000, ICML.

[6]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[7]  J. Lafferty,et al.  Mixed-membership models of scientific publications , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Michal Rosen-Zvi,et al.  Hidden Topic Markov Models , 2007, AISTATS.

[9]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[10]  Aleks Jakulin,et al.  Applying Discrete PCA in Data Analysis , 2004, UAI.

[11]  John D. Lafferty,et al.  Correlated Topic Models , 2005, NIPS.

[12]  Ramesh Nallapati,et al.  Link-PLSA-LDA: A New Unsupervised Model for Topics and Influence of Blogs , 2021, ICWSM.

[13]  Thomas L. Griffiths,et al.  Integrating Topics and Syntax , 2004, NIPS.

[14]  Tom Minka,et al.  Expectation-Propogation for the Generative Aspect Model , 2002, UAI.

[15]  David A. Cohn,et al.  The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity , 2000, NIPS.

[16]  Steffen Bickel,et al.  Unsupervised prediction of citation influences , 2007, ICML '07.

[17]  Hanna M. Wallach,et al.  Topic modeling: beyond bag-of-words , 2006, ICML.