Probabilistic topic models with biased propagation on heterogeneous information networks

With the development of Web applications, textual documents are not only getting richer, but also ubiquitously interconnected with users and other objects in various ways, which brings about text-rich heterogeneous information networks. Topic models have been proposed and shown to be useful for document analysis, and the interactions among multi-typed objects play a key role at disclosing the rich semantics of the network. However, most of topic models only consider the textual information while ignore the network structures or can merely integrate with homogeneous networks. None of them can handle heterogeneous information network well. In this paper, we propose a novel topic model with biased propagation (TMBP) algorithm to directly incorporate heterogeneous information network with topic modeling in a unified way. The underlying intuition is that multi-typed objects should be treated differently along with their inherent textual information and the rich semantics of the heterogeneous information network. A simple and unbiased topic propagation across such a heterogeneous network does not make much sense. Consequently, we investigate and develop two biased propagation frameworks, the biased random walk framework and the biased regularization framework, for the TMBP algorithm from different perspectives, which can discover latent topics and identify clusters of multi-typed objects simultaneously. We extensively evaluate the proposed approach and compare to the state-of-the-art techniques on several datasets. Experimental results demonstrate that the improvement in our proposed approach is consistent and promising.

[1]  Michael R. Lyu,et al.  A generalized Co-HITS algorithm and its application to bipartite graphs , 2009, KDD.

[2]  Jiawei Han,et al.  Modeling hidden topics on document manifold , 2008, CIKM '08.

[3]  Alexander J. Smola,et al.  Kernels and Regularization on Graphs , 2003, COLT.

[4]  Hua Li,et al.  Improving web search results using affinity graph , 2005, SIGIR '05.

[5]  Thomas Hofmann,et al.  Semi-supervised Learning on Directed Graphs , 2004, NIPS.

[6]  David A. Cohn,et al.  The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity , 2000, NIPS.

[7]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[8]  W. Marsden I and J , 2012 .

[9]  Bo Zhao,et al.  Collective topic modeling for heterogeneous networks , 2011, SIGIR '11.

[10]  Yizhou Sun,et al.  Ranking-based clustering of heterogeneous information networks with star network schema , 2009, KDD.

[11]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[12]  John D. Lafferty,et al.  Correlated Topic Models , 2005, NIPS.

[13]  Deng Cai,et al.  Topic modeling with network regularization , 2008, WWW.

[14]  Hongyuan Zha,et al.  Exploring social annotations for information retrieval , 2008, WWW.

[15]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[16]  Stephen E. Fienberg,et al.  Discriminative Topic Modeling Based on Manifold Learning , 2012, ACM Trans. Knowl. Discov. Data.

[17]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[18]  Thomas L. Griffiths,et al.  Probabilistic author-topic models for information discovery , 2004, KDD.

[19]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[20]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[21]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[22]  Allan Borodin,et al.  Link analysis ranking: algorithms, theory, and experiments , 2005, TOIT.

[23]  Ramesh Nallapati,et al.  Joint latent topic models for text and citations , 2008, KDD.

[24]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[25]  W. Press,et al.  Numerical Recipes in Fortran: The Art of Scientific Computing.@@@Numerical Recipes in C: The Art of Scientific Computing. , 1994 .

[26]  Jiawei Han,et al.  Geographical topic discovery and comparison , 2011, WWW.

[27]  Yizhou Sun,et al.  iTopicModel: Information Network-Integrated Topic Modeling , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[28]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[29]  Jie Tang,et al.  ArnetMiner: extraction and mining of academic social networks , 2008, KDD.

[30]  Hongbo Deng,et al.  Effective latent space graph-based re-ranking model with global consistency , 2009, WSDM '09.

[31]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[32]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[33]  Deng Cai,et al.  Probabilistic dyadic data analysis with local and global consistency , 2009, ICML '09.

[34]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[35]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[36]  William H. Press,et al.  The Art of Scientific Computing Second Edition , 1998 .