Discovering Canonical Correlations between Topical and Topological Information in Document Networks

Document network is a kind of intriguing dataset which can provide both topical (textual content) and topological (relational link) information. A key point in modeling such datasets is to discover proper denominators beneath the text and link. Most previous work introduces the assumption that documents closely linked with each other share common latent topics. However, the heterophily (i.e., tendency to link to different others) of nodes is neglected, which is pervasive in social networks. In this paper, we simultaneously incorporate community detection and topic modeling in a unified framework, and appeal to Canonical Correlation Analysis (CCA) to capture the latent semantic correlations between the two heterogeneous factors, community and topic. Despite of the homophily (i.e., tendency to link to similar others) or heterophily, CCA can properly capture the inherent correlations which fit the dataset itself without any prior hypothesis. We also impose auxiliary word embeddings to improve the quality of topics. The effectiveness of our proposed model is comprehensively verified on three different types of datasets which are hyperlinked networks of web pages, social networks of friends, and coauthor networks of publications. Experimental results show that our approach achieves significant improvements compared with the current state of the art.

[1]  Kristian Kersting,et al.  Topic Models Conditioned on Relations , 2010, ECML/PKDD.

[2]  Ruixuan Li,et al.  LIMTopic: A Framework of Incorporating Link Based Importance into Topic Modeling , 2014, IEEE Transactions on Knowledge and Data Engineering.

[3]  M E J Newman,et al.  Modularity and community structure in networks. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[5]  Qi He,et al.  TwitterRank: finding topic-sensitive influential twitterers , 2010, WSDM '10.

[6]  Amr Ahmed,et al.  On Tight Approximate Inference of the Logistic-Normal Topic Admixture Model , 2007 .

[7]  Jure Leskovec,et al.  Supervised random walks: predicting and recommending links in social networks , 2010, WSDM '11.

[8]  Yong Yu,et al.  Mining topics on participations for community discovery , 2011, SIGIR.

[9]  John Yen,et al.  Probabilistic Community Discovery Using Hierarchical Latent Gaussian Mixture Model , 2007, AAAI.

[10]  Bing Liu,et al.  Mining topics in documents: standing on the shoulders of big data , 2014, KDD.

[11]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Dat Quoc Nguyen,et al.  Improving Topic Models with Latent Feature Word Representations , 2015, TACL.

[13]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[14]  Xianggui Qu,et al.  Multivariate Data Analysis , 2007, Technometrics.

[15]  John Aitchison,et al.  The Statistical Analysis of Compositional Data , 1986 .

[16]  Arjun Mukherjee,et al.  Leveraging Multi-Domain Prior Knowledge in Topic Models , 2013, IJCAI.

[17]  Tom M. Mitchell,et al.  Random Walk Inference and Learning in A Large Scale Knowledge Base , 2011, EMNLP.

[18]  Jure Leskovec,et al.  Community Detection in Networks with Node Attributes , 2013, 2013 IEEE 13th International Conference on Data Mining.

[19]  David M. Blei,et al.  Relational Topic Models for Document Networks , 2009, AISTATS.

[20]  Wiley Interscience Journal of the American Society for Information Science and Technology , 2013 .

[21]  Thomas Seidl,et al.  Subspace Clustering Meets Dense Subgraph Mining: A Synthesis of Two Paradigms , 2010, 2010 IEEE International Conference on Data Mining.

[22]  Jon M. Kleinberg,et al.  The link-prediction problem for social networks , 2007, J. Assoc. Inf. Sci. Technol..

[23]  Andrew McCallum,et al.  Topic Models Conditioned on Arbitrary Features with Dirichlet-multinomial Regression , 2008, UAI.

[24]  Chunyan Miao,et al.  Generative Topic Embedding: a Continuous Representation of Documents , 2016, ACL.

[25]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[26]  Cheng Wang,et al.  Discovering Canonical Correlations between Topical and Topological Information in Document Networks , 2018, IEEE Trans. Knowl. Data Eng..

[27]  Zoubin Ghahramani,et al.  Graph Kernels by Spectral Transforms , 2006, Semi-Supervised Learning.

[28]  Terrence J. Sejnowski,et al.  A Variational Principle for Graphical Models , 2007 .

[29]  Andrew McCallum,et al.  Topic and Role Discovery in Social Networks , 2005, IJCAI.

[30]  Sanjeev Arora,et al.  A Practical Algorithm for Topic Modeling with Provable Guarantees , 2012, ICML.

[31]  Ning Chen,et al.  Generalized Relational Topic Models with Data Augmentation , 2013, IJCAI.

[32]  Bo Hu,et al.  Spatio-Temporal Topic Modeling in Mobile Social Media for Location Recommendation , 2013, 2013 IEEE 13th International Conference on Data Mining.

[33]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[34]  Rajarshi Das,et al.  Gaussian LDA for Topic Models with Word Embeddings , 2015, ACL.

[35]  Jon Kleinberg,et al.  The link prediction problem for social networks , 2003, CIKM '03.

[36]  John D. Lafferty,et al.  Correlated Topic Models , 2005, NIPS.

[37]  M. Newman,et al.  Finding community structure in very large networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[38]  Zhi-Hua Zhou,et al.  A spectral approach to detecting subtle anomalies in graphs , 2013, Journal of Intelligent Information Systems.

[39]  Cornelia Caragea,et al.  Context Sensitive Topic Models for Author Influence in Document Networks , 2011, IJCAI.

[40]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[41]  Deng Cai,et al.  Topic modeling with network regularization , 2008, WWW.

[42]  Andrew McCallum,et al.  Optimizing Semantic Coherence in Topic Models , 2011, EMNLP.

[43]  Le Song,et al.  Dynamic mixed membership blockmodel for evolving networks , 2009, ICML '09.

[44]  Katia Sycara,et al.  Random Walk Features for Network-aware Topic Models , 2013 .

[45]  Jon M. Kleinberg,et al.  The small-world phenomenon: an algorithmic perspective , 2000, STOC '00.

[46]  Edoardo M. Airoldi,et al.  Mixed Membership Stochastic Blockmodels , 2007, NIPS.

[47]  Jiawei Han,et al.  Latent Community Topic Analysis: Integration of Community Discovery with Topic Modeling , 2012, TIST.

[48]  Yan Liu,et al.  Topic-link LDA: joint models of topic and author community , 2009, ICML '09.

[49]  Zhiyuan Liu,et al.  Topical Word Embeddings , 2015, AAAI.

[50]  Bo Hu,et al.  Social Topic Modeling for Point-of-Interest Recommendation in Location-Based Social Networks , 2014, 2014 IEEE International Conference on Data Mining.

[51]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[52]  Michael I. Jordan,et al.  A Probabilistic Interpretation of Canonical Correlation Analysis , 2005 .

[53]  Changjun Jiang,et al.  Multi-perspective Hierarchical Dirichlet Process for Geographical Topic Modeling , 2017, PAKDD.

[54]  A. Banerjee,et al.  Social Topic Models for Community Extraction , 2008 .

[55]  J. Lafferty,et al.  Mixed-membership models of scientific publications , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[56]  Aapo Hyvärinen,et al.  Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics , 2012, J. Mach. Learn. Res..

[57]  Aditya Johri,et al.  Finding Community Topics and Membership in Graphs , 2015, ECML/PKDD.

[58]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[59]  Stephen E. Fienberg,et al.  Discriminative Topic Modeling Based on Manifold Learning , 2012, ACM Trans. Knowl. Discov. Data.

[60]  Thomas L. Griffiths,et al.  Probabilistic author-topic models for information discovery , 2004, KDD.

[61]  Yizhou Sun,et al.  ETM: Entity Topic Models for Mining Documents Associated with Entities , 2012, 2012 IEEE 12th International Conference on Data Mining.

[62]  Zoubin Ghahramani,et al.  Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning , 2004, NIPS.

[63]  R. Nelsen An Introduction to Copulas , 1998 .

[64]  Deng Cai,et al.  Probabilistic dyadic data analysis with local and global consistency , 2009, ICML '09.

[65]  Thomas L. Griffiths,et al.  Learning Systems of Concepts with an Infinite Relational Model , 2006, AAAI.

[66]  Changjun Jiang,et al.  Modeling Document Networks with Tree-Averaged Copula Regularization , 2017, WSDM.

[67]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[68]  Nicola Barbieri,et al.  Who to follow and why: link prediction with explanations , 2014, KDD.

[69]  Jimeng Sun,et al.  Latent association analysis of document pairs , 2012, KDD.

[70]  Aixin Sun,et al.  Topic Modeling for Short Texts with Auxiliary Word Embeddings , 2016, SIGIR.