GSimRank: A General Similarity Measure on Heterogeneous Information Network

Measuring similarity of objects in information network is a primitive problem and has attracted many studies for widely applications, such as recommendation and information retrieval. With the advent of large-scale heterogeneous information network that consist of multi-type relationships, it is important to research similarity measure in such networks. However, most existing similarity measures are defined for homogeneous network and cannot be directly applied to HINs since different semantic meanings behind edges should be considered. This paper proposes GSimRank that is the extended form of the famous SimRank to compute similarity on HINs. Rather than summing all meeting paths for two nodes in SimRank, GSimRank selects linked nodes of the same semantic category as the next step in the pairwise random walk, which ensure the two meeting paths share the same semantic. Further, in order to weight the semantic edges, we propose a domain-independent edge weight evaluation method based on entropy theory. Finally, we proof that GSimRank is still based on the expected meeting distance model and provide experiments on two real world datasets showing the performance of GSimRank.