Modeling Document Networks with Tree-Averaged Copula Regularization

Document network is a kind of intriguing dataset which provides both topical (texts) and topological (links) information. Most previous work assumes that documents closely linked with each other share common topics. However, the associations among documents are usually complex, which are not limited to the homophily (i.e., tendency to link to similar others). Actually, the heterophily (i.e., tendency to link to different others) is another pervasive phenomenon in social networks. In this paper, we introduce a new tool, called copula, to separately model the documents and links, so that different copula functions can be applied to capture different correlation patterns. In statistics, a copula is a powerful framework for explicitly modeling the dependence of random variables by separating the marginals and their correlations. Though widely used in Economics, copulas have not been paid enough attention to by researchers in machine learning field. Besides, to further capture the potential associations among the unconnected documents, we apply the tree-averaged copula instead of a single copula function. This improvement makes our model achieve better expressive power, and also more elegant in algebra. We derive efficient EM algorithms to estimate the model parameters, and evaluate the performance of our model on three different datasets. Experimental results show that our approach achieves significant improvements on both topic and link modeling compared with the current state of the art.

[1]  David M. Blei,et al.  Relational Topic Models for Document Networks , 2009, AISTATS.

[2]  Longbing Cao,et al.  Copula Mixed-Membership Stochastic Blockmodel for Intra-Subgroup Correlations , 2013, 1306.2733.

[3]  Sergey Kirshner,et al.  Learning with Tree-Averaged Densities and Distributions , 2007, NIPS.

[4]  Gal Elidan,et al.  Copula Bayesian Networks , 2010, NIPS.

[5]  C. Varin,et al.  Gaussian Copula Marginal Regression , 2012 .

[6]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[7]  Zoubin Ghahramani,et al.  Graph Kernels by Spectral Transforms , 2006, Semi-Supervised Learning.

[8]  Robert B. Gramacy,et al.  MCMC Methods for Bayesian Mixtures of Copulas , 2009, AISTATS.

[9]  Ning Chen,et al.  Generalized Relational Topic Models with Data Augmentation , 2013, IJCAI.

[10]  Andrew McCallum,et al.  Optimizing Semantic Coherence in Topic Models , 2011, EMNLP.

[11]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[12]  R. Ibragimov,et al.  Copula Estimation , 2009 .

[13]  Tom M. Mitchell,et al.  Random Walk Inference and Learning in A Large Scale Knowledge Base , 2011, EMNLP.

[14]  Kristian Kersting,et al.  Topic Models Conditioned on Relations , 2010, ECML/PKDD.

[15]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Bing Liu,et al.  Mining topics in documents: standing on the shoulders of big data , 2014, KDD.

[17]  Kathryn Fraughnaugh,et al.  Introduction to graph theory , 1973, Mathematical Gazette.

[18]  Yan Liu,et al.  Topic-link LDA: joint models of topic and author community , 2009, ICML '09.

[19]  Kevyn Collins-Thompson,et al.  Copulas for information retrieval , 2013, SIGIR.

[20]  C. Genest,et al.  Everything You Always Wanted to Know about Copula Modeling but Were Afraid to Ask , 2007 .

[21]  M. Pitt,et al.  Efficient Bayesian inference for Gaussian copula regression models , 2006 .

[22]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[23]  Steffen Bickel,et al.  Unsupervised prediction of citation influences , 2007, ICML '07.

[24]  J. Besag Spatial Interaction and the Statistical Analysis of Lattice Systems , 1974 .

[25]  M. Sklar Fonctions de repartition a n dimensions et leurs marges , 1959 .

[26]  Edoardo M. Airoldi,et al.  Mixed Membership Stochastic Blockmodels , 2007, NIPS.

[27]  Jiawei Han,et al.  Latent Community Topic Analysis: Integration of Community Discovery with Topic Modeling , 2012, TIST.

[28]  Tommi S. Jaakkola,et al.  Tractable Bayesian learning of tree belief networks , 2000, Stat. Comput..

[29]  Kjersti Aas,et al.  Modelling the dependence structure of financial assets : A survey of four copulas , 2004 .

[30]  Gal Elidan,et al.  Copulas in Machine Learning , 2013 .

[31]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[32]  Changjun Jiang,et al.  Discovering Canonical Correlations between Topical and Topological Information in Document Networks , 2015, IEEE Transactions on Knowledge and Data Engineering.

[33]  Deng Cai,et al.  Topic modeling with network regularization , 2008, WWW.

[34]  William Yang Wang,et al.  A Semiparametric Gaussian Copula Regression Model for Predicting Financial Risks from Earnings Calls , 2014, ACL.

[35]  Katia Sycara,et al.  Random Walk Features for Network-aware Topic Models , 2013 .

[36]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[37]  Andrew McCallum,et al.  Topic Models Conditioned on Arbitrary Features with Dirichlet-multinomial Regression , 2008, UAI.

[38]  Xiaohong Chen,et al.  Estimation of Copula-Based Semiparametric Time Series Models , 2006 .

[39]  John Yen,et al.  Probabilistic Community Discovery Using Hierarchical Latent Gaussian Mixture Model , 2007, AAAI.

[40]  Satishs Iyengar,et al.  Multivariate Models and Dependence Concepts , 1998 .

[41]  Bill Ravens,et al.  An Introduction to Copulas , 2000, Technometrics.

[42]  Thorsten Schmidt,et al.  Coping with Copulas , 2006 .

[43]  J. Lafferty,et al.  Mixed-membership models of scientific publications , 2004, Proceedings of the National Academy of Sciences of the United States of America.