TopicFlow Model: Unsupervised Learning of Topic-specific Influences of Hyperlinked Documents

Popular algorithms for modeling the influence of entities in networked data, such as PageRank, work by analyzing the hyperlink structure, but ignore the contents of documents. However, often times, influence is topic dependent, e.g., a web page of high influence in politics may be an unknown entity in sports. We design a new model called TopicFlow, which combines ideas from network flow and topic modeling, to learn this notion of topic specific influences of hyperlinked documents in a completely unsupervised fashion. On the task of citation recommendation, which is an instance of capturing influence, the TopicFlow model, when combined with TF-IDF based cosine similarity, outperforms several competitive baselines by as much as 11.8%. Our empirical study of the model’s output on ACL corpus demonstrates its ability to identify topically influential documents. The TopicFlow model is also competitive with the state-of-theart Relational Topic Models in predicting the likelihood of unseen text on two different data sets. Due to its ability to learn topic-specific flows across each hyperlink, the TopicFlow model can be a powerful visualization tool to track the diffusion of topics across a citation network.