HighSim : Highly Effective Similarity Measurement in Large Heterogeneous Information Networks

Heterogeneous information networks consist of rich information with many typed-links and typed-objects. Nowadays, finding useful knowledge from large information networks has attracted the attention of a large number of researchers. Some famous ranking algorithms like P-PageRank, PathSim and SimRank have been proposed to find the Top-K similar objects. However, SimRank has very high computational complexity while PathSim only does similarity measurement based on a single meta path. In this paper, we develop a novel HighSim algorithm, which integrates the PathSim algorithm and the basic methodology in LINE algorithm, to leverage the similarity ranking by considering both the research topics and the venues of published papers of different authors. In specific, we use PathSim based on the meta path Author-Paper-Venue-Paper-Author (AVPVA) to find the similarity of the venues of published papers. And LINE is used to find the similar research topics of different authors through their cited papers, i.e. references. Then we use the dataset in bibliographic networks extracted from DBLP to evaluate the performance of our new algorithm. The results show the effectiveness and flexibility of our proposed algorithm.

[1]  Qi He,et al.  Mining strong relevance between heterogeneous entities from unstructured biomedical data , 2015, Data Mining and Knowledge Discovery.

[2]  Trevor Cohen,et al.  MEDRank: Using graph-based concept ranking to index biomedical texts , 2011, Int. J. Medical Informatics.

[3]  Philip S. Yu,et al.  PathSim , 2011, Proc. VLDB Endow..

[4]  Mingzhe Wang,et al.  LINE: Large-scale Information Network Embedding , 2015, WWW.

[5]  Philip S. Yu,et al.  HeteSim: A General Framework for Relevance Measure in Heterogeneous Networks , 2013, IEEE Transactions on Knowledge and Data Engineering.

[6]  Jiawei Han Mining Heterogeneous Information Networks by Exploring the Power of Links , 2009, Discovery Science.

[7]  Yizhou Sun,et al.  Ranking-based clustering of heterogeneous information networks with star network schema , 2009, KDD.

[8]  Yizhou Sun,et al.  RankClus: integrating clustering with ranking for heterogeneous information network analysis , 2009, EDBT '09.

[9]  Jennifer Widom,et al.  Scaling personalized web search , 2003, WWW '03.

[10]  Xiaoyong Du,et al.  MapReduce-Based SimRank Computation and Its Application in Social Recommender System , 2013, 2013 IEEE International Congress on Big Data.

[11]  Vagelis Hristidis,et al.  ObjectRank: Authority-Based Keyword Search in Databases , 2004, VLDB.

[12]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[13]  Andrea Fusiello,et al.  Robust Multiple Structures Estimation with J-Linkage , 2008, ECCV.

[14]  Xuemin Lin,et al.  Top-k Set Similarity Joins , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[15]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[16]  Xiaowei Xu,et al.  SCAN: a structural clustering algorithm for networks , 2007, KDD '07.

[17]  Julie A. McCann,et al.  Efficient Partial-Pairs SimRank Search for Large Networks , 2015, Proc. VLDB Endow..

[18]  Yasuhiro Fujiwara,et al.  Efficient search algorithm for SimRank , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).