Hierarchical relational models for document networks

We develop the relational topic model (RTM), a hierarchical model of both network structure and node attributes. We focus on document networks, where the attributes of each document are its words, that is, discrete observations taken from a fixed vocabulary. For each pair of documents, the RTM models their link as a binary random variable that is conditioned on their contents. The model can be used to summarize a network of documents, predict links between them, and predict words within them. We derive efficient inference and estimation algorithms based on variational methods that take advantage of sparsity and scale with the number of links. We evaluate the predictive performance of the RTM for large networks of scientific abstracts, web documents, and geographically tagged news.

[1]  Donald B. Rubin,et al.  Max-imum Likelihood from Incomplete Data , 1972 .

[2]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[3]  C. Antoniak Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems , 1974 .

[4]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[5]  M. M. Meyer,et al.  Statistical Analysis of Multiple Sociometric Relations. , 1985 .

[6]  S. Wasserman,et al.  Logit models and logistic regressions for social networks: I. An introduction to Markov graphs andp , 1996 .

[7]  Tom M. Mitchell,et al.  Learning to Extract Symbolic Knowledge from the World Wide Web , 1998, AAAI/IAAI.

[8]  M. KleinbergJon Authoritative sources in a hyperlinked environment , 1999 .

[9]  S. Wasserman,et al.  Logit models and logistic regressions for social networks: II. Multivariate relations. , 1999, The British journal of mathematical and statistical psychology.

[10]  David A. Cohn,et al.  The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity , 2000, NIPS.

[11]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[12]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[13]  Ben Taskar,et al.  Learning Probabilistic Models of Relational Structure , 2001, ICML.

[14]  Peter D. Hoff,et al.  Latent Space Approaches to Social Network Analysis , 2002 .

[15]  Mark Newman,et al.  The structure and function of networks , 2002 .

[16]  Ben Taskar,et al.  Link Prediction in Relational Data , 2003, NIPS.

[17]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[18]  David A. Forsyth,et al.  Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[19]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[20]  Thomas L. Griffiths,et al.  Discovering Latent Classes in Relational Data , 2004 .

[21]  Andrew McCallum,et al.  Automating the Construction of Internet Portals with Machine Learning , 2000, Information Retrieval.

[22]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[23]  J. Lafferty,et al.  Mixed-membership models of scientific publications , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Dunja Mladenic,et al.  Proceedings of the 3rd international workshop on Link discovery , 2005, KDD 2005.

[25]  Andrew McCallum,et al.  Topic and Role Discovery in Social Networks , 2005, IJCAI.

[26]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[27]  Andrew McCallum,et al.  Group and topic discovery from relations and text , 2005, LinkKDD '05.

[28]  Martin J. Wainwright,et al.  A variational principle for graphical models , 2005 .

[29]  Edoardo M. Airoldi,et al.  Stochastic Block Models of Mixed Membership , 2006 .

[30]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[31]  Hans-Peter Kriegel,et al.  Infinite Hidden Relational Models , 2006, UAI.

[32]  Michael I. Jordan,et al.  Variational inference for Dirichlet process mixtures , 2006 .

[33]  Thomas L. Griffiths,et al.  Probabilistic Topic Models , 2007 .

[34]  Steffen Bickel,et al.  Unsupervised prediction of citation influences , 2007, ICML '07.

[35]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[36]  Terrence J. Sejnowski,et al.  A Variational Principle for Graphical Models , 2007 .

[37]  Danielle S. McNamara,et al.  Handbook of latent semantic analysis , 2007 .

[38]  S. Fienberg,et al.  DESCRIBING DISABILITY THROUGH INDIVIDUAL-LEVEL MIXTURE MODELS FOR MULTIVARIATE BINARY DATA. , 2007, The annals of applied statistics.

[39]  Volker Tresp,et al.  Nonparametric Relational Learning for Social Network Analysis , 2008 .

[40]  David M. Blei,et al.  Syntactic Topic Models , 2008, NIPS.

[41]  Janne Sinkkonen,et al.  Component models for large networks , 2008, 0803.1628.

[42]  Ramesh Nallapati,et al.  Link-PLSA-LDA: A New Unsupervised Model for Topics and Influence of Blogs , 2021, ICWSM.

[43]  Chris H Wiggins,et al.  Bayesian approach to network modularity. , 2007, Physical review letters.

[44]  Michal Rosen-Zvi,et al.  Latent Topic Models for Hypertext , 2008, UAI.

[45]  Deng Cai,et al.  Topic modeling with network regularization , 2008, WWW.

[46]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[47]  Edoardo M. Airoldi,et al.  Mixed Membership Stochastic Blockmodels , 2007, NIPS.

[48]  Ramesh Nallapati,et al.  Joint latent topic models for text and citations , 2008, KDD.

[49]  I. C. Gormley,et al.  A grade of membership model for rank data , 2009 .

[50]  Jon D. McAuliffe,et al.  Variational Inference for Large-Scale Models of Discrete Choice , 2007, 0712.2526.