Discovering Canonical Correlations between Topical and Topological Information in Document Networks

Document network is a kind of intriguing dataset which can provide both topical (textual content) and topological (relational link) information. A key point in modeling such datasets is to discover proper denominators beneath the text and link. Most previous work introduces the assumption that documents closely linked with each other share common latent topics. However, the heterophily (i.e., tendency to link to different others) of nodes is neglected, which is pervasive in social networks. In this paper, we simultaneously incorporate community detection and topic modeling in a unified framework, and appeal to Canonical Correlation Analysis (CCA) to capture the latent semantic correlations between the two heterogeneous factors, community and topic. Despite of the homophily (i.e., tendency to link to similar others) or heterophily, CCA can properly capture the inherent correlations which fit the dataset itself without any prior hypothesis. We also impose auxiliary word embeddings to improve the quality of topics. The effectiveness of our proposed model is comprehensively verified on three different types of datasets which are hyperlinked networks of web pages, social networks of friends, and coauthor networks of publications. Experimental results show that our approach achieves significant improvements compared with the current state of the art.

