Topic modeling with network regularization

In this paper, we formally define the problem of topic modeling with network structure (TMN). We propose a novel solution to this problem, which regularizes a statistical topic model with a harmonic regularizer based on a graph structure in the data. The proposed method bridges topic modeling and social network analysis, which leverages the power of both statistical topic models and discrete regularization. The output of this model well summarizes topics in text, maps a topic on the network, and discovers topical communities. With concrete selection of a topic model and a graph-based regularizer, our model can be applied to text mining problems such as author-topic analysis, community discovery, and spatial text mining. Empirical experiments on two different genres of data show that our approach is effective, which improves text-oriented methods as well as network-oriented methods. The proposed model is general; it can be applied to any text collections with a mixture of topics and an associated network structure.

[1]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[2]  Martine D. F. Schlag,et al.  Spectral K-Way Ratio-Cut Partitioning and Clustering , 1993, 30th ACM/IEEE Design Automation Conference.

[3]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[4]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[5]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[6]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[7]  Ravi Kumar,et al.  Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[8]  David A. Cohn,et al.  The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity , 2000, NIPS.

[9]  Jon M. Kleinberg,et al.  The small-world phenomenon: an algorithmic perspective , 2000, STOC '00.

[10]  Matthew Richardson,et al.  The Intelligent surfer: Probabilistic Combination of Link and Content Information in PageRank , 2001, NIPS.

[11]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[12]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[13]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[14]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[15]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[16]  Ramanathan V. Guha,et al.  Information diffusion through blogspace , 2004, WWW '04.

[17]  Bei Yu,et al.  A cross-collection mixture model for comparative text mining , 2004, KDD.

[18]  Thomas L. Griffiths,et al.  Probabilistic author-topic models for information discovery , 2004, KDD.

[19]  Andrew McCallum,et al.  Topic and Role Discovery in Social Networks , 2005, IJCAI.

[20]  Christos Faloutsos,et al.  Graphs over time: densification laws, shrinking diameters and possible explanations , 2005, KDD '05.

[21]  Xiaojin Zhu,et al.  Harmonic mixtures: combining mixture models and graph-based methods for inductive and scalable semi-supervised learning , 2005, ICML.

[22]  Luo Si,et al.  Adjusting Mixture Weights of Gaussian Mixture Model via Regularized Probabilistic Latent Semantic Analysis , 2005, PAKDD.

[23]  Ran El-Yaniv,et al.  Multi-way distributional clustering via pairwise interactions , 2005, ICML.

[24]  Jon M. Kleinberg,et al.  Group formation in large social networks: membership, growth, and evolution , 2006, KDD '06.

[25]  Discrete Regularization , 2006, Semi-Supervised Learning.

[26]  ChengXiang Zhai,et al.  A mixture model for contextual text mining , 2006, KDD '06.

[27]  Xiang Ji,et al.  Topic evolution and social interactions: how authors effect research , 2006, CIKM '06.

[28]  Philip S. Yu,et al.  Spectral clustering for multi-type relational data , 2006, ICML.

[29]  Hongyuan Zha,et al.  Probabilistic models for discovering e-communities , 2006, WWW '06.

[30]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[31]  Chao Liu,et al.  A probabilistic approach to spatiotemporal theme pattern mining on weblogs , 2006, WWW '06.

[32]  Wei Li,et al.  Pachinko allocation: DAG-structured mixture models of topic correlations , 2006, ICML.

[33]  Christos Faloutsos,et al.  Cascading Behavior in Large Blog Graphs , 2007 .