iTopicModel: Information Network-Integrated Topic Modeling

Document networks, i.e., networks associated with text information, are becoming increasingly popular due to the ubiquity of Web documents, blogs, and various kinds of online data. In this paper, we propose a novel topic modeling framework for document networks, which builds a unified generative topic model that is able to consider both text and structure information for documents. A graphical model is proposed to describe the generative model. On the top layer of this graphical model, we define a novel multivariate Markov Random Field for topic distribution random variables for each document, to model the dependency relationships among documents over the network structure. On the bottom layer, we follow the traditional topic model to model the generation of text for each document. A joint distribution function for both the text and structure of the documents is thus provided. A solution to estimate this topic model is given, by maximizing the log-likelihood of the joint probability. Some important practical issues in real applications are also discussed, including how to decide the topic number and how to choose a good network structure. We apply the model on two real datasets, DBLP and Cora, and the experiments show that this model is more effective in comparison with the state-of-the-art topic modeling algorithms.

[1]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[2]  P. Pérez,et al.  Markov random fields and images , 1998 .

[3]  Gerhard Weikum,et al.  Graph-based text classification: learn from your neighbors , 2006, SIGIR.

[4]  David M. Blei,et al.  Relational Topic Models for Document Networks , 2009, AISTATS.

[5]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[6]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[7]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[9]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[10]  Thomas L. Griffiths,et al.  Probabilistic author-topic models for information discovery , 2004, KDD.

[11]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[12]  Stefan Siersdorfer,et al.  A neighborhood-based approach for clustering of linked document collections , 2006, CIKM '06.

[13]  Stan Z. Li Markov Random Field Modeling in Image Analysis , 2009, Advances in Pattern Recognition.

[14]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[15]  Deng Cai,et al.  Topic modeling with network regularization , 2008, WWW.