HeteRank: A general similarity measure in heterogeneous information networks by integrating multi-type relationships

Abstract With heterogeneous information networks becoming ubiquitous and complex, lots of data mining tasks have been explored, including clustering, collaborative filtering and link prediction. Similarity computation is a fundamental task required for many problems of data mining. Although a large amount of similarity measures are developed for assessing similarities in heterogeneous networks, they are usually dependent on the network schema and lack a general manner for integrating kinds of relationships between objects. In this paper, we propose a similarity measure, namely HeteRank, for generally computing similarities in heterogeneous information networks. The relationships between different type objects are represented by a general relationship matrix (GRM) that is built based on the scales of different type objects. Based on GRM, HeteRank fully integrates the multi-type relationships into similarity computation by utilizing all the meetings between objects. The HeteRank equation is further transformed into a simple binomial expression form with considering restart probability. For efficiently computing HeteRank similarities, we divide the similarity computation into two steps: the first step is to compute the intermediate values, and the second step is to compute the similarities based on intermediate values. And then we approximate HeteRank equation by setting thresholds for skipping lower intermediate values and similarity scores. A pruning algorithm is developed to reduce the unnecessary visits, multiplications and additions that make little contribution during similarity computation. Extensive experiments on real datasets demonstrate the effectiveness and efficiency of HeteRank through comparing with the state-of-the-art similarity measures.

[1]  Kevin Chen-Chuan Chang,et al.  Semantic proximity search on graphs with metagraph-based learning , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[2]  Yizhou Sun,et al.  Fast computation of SimRank for static and dynamic information networks , 2010, EDBT '10.

[3]  Thomas Demeester,et al.  Learning Semantic Similarity for Very Short Texts , 2015, 2015 IEEE International Conference on Data Mining Workshop (ICDMW).

[4]  Hao Hu,et al.  A comprehensive structural-based similarity measure in directed graphs , 2015, Neurocomputing.

[5]  Abdelmajid Ben Hamadou,et al.  Ontology-based approach for measuring semantic similarity , 2014, Eng. Appl. Artif. Intell..

[6]  Bin Wu,et al.  Relevance Measure in Large-Scale Heterogeneous Networks , 2014, APWeb.

[7]  Jennifer Widom,et al.  Exploiting hierarchical domain structure to compute similarity , 2003, TOIS.

[8]  Xiang Li,et al.  Meta Structure: Computing Relevance in Large Heterogeneous Information Networks , 2016, KDD.

[9]  Nitesh V. Chawla,et al.  Link Prediction and Recommendation across Heterogeneous Social Networks , 2012, 2012 IEEE 12th International Conference on Data Mining.

[10]  ChunZhi Xie,et al.  An approach for selecting seed URLs of focused crawler based on user-interest ontology , 2014, Appl. Soft Comput..

[11]  Edward A. Fox,et al.  SimFusion: measuring similarity using unified relationship matrix , 2005, SIGIR '05.

[12]  Xuemin Lin,et al.  SimFusion+: extending simfusion towards efficient estimation on large and dynamic networks , 2012, SIGIR '12.

[13]  Jie Tang,et al.  A Combination Approach to Web User Profiling , 2010, TKDD.

[14]  Yizhou Sun,et al.  User guided entity similarity search using meta-path selection in heterogeneous information networks , 2012, CIKM.

[15]  Wenjun Liu,et al.  An improved focused crawler based on Semantic Similarity Vector Space Model , 2015, Appl. Soft Comput..

[16]  Fikret S. Gürgen,et al.  Scalable and adaptive collaborative filtering by mining frequent item co-occurrences in a user feedback stream , 2017, Eng. Appl. Artif. Intell..

[17]  Xuemin Lin,et al.  On the Efficiency of Estimating Penetrating Rank on Large Graphs , 2012, SSDBM.

[18]  Yizhou Sun,et al.  Ranking-based clustering of heterogeneous information networks with star network schema , 2009, KDD.

[19]  Jure Leskovec,et al.  The dynamics of viral marketing , 2005, EC '06.

[20]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[21]  Chengjun Liu,et al.  Discriminant analysis and similarity measure , 2014, Pattern Recognit..

[22]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[23]  Hongyan Liu,et al.  Exploiting the Block Structure of Link Graph for Efficient Similarity Computation , 2009, PAKDD.

[24]  Michael R. Lyu,et al.  MatchSim: a novel similarity measure based on maximum neighborhood matching , 2012, Knowledge and Information Systems.

[25]  Ana M. García-Serrano,et al.  HESML: A scalable ontology-based semantic similarity measures library with a set of reproducible experiments and a replication dataset , 2017, Inf. Syst..

[26]  Dániel Fogaras,et al.  Scaling link-based similarity search , 2005, WWW '05.

[27]  Hongyan Liu,et al.  Fast Single-Pair SimRank Computation , 2010, SDM.

[28]  Kevin Chen-Chuan Chang,et al.  Semantic Proximity Search on Heterogeneous Graph by Proximity Embedding , 2017, AAAI.

[29]  Julie A. McCann,et al.  Efficient Partial-Pairs SimRank Search for Large Networks , 2015, Proc. VLDB Endow..

[30]  Pavel Velikhov,et al.  Accuracy estimate and optimization techniques for SimRank computation , 2008, Proc. VLDB Endow..

[31]  John Hardy,et al.  An additional k-means clustering step improves the biological features of WGCNA gene co-expression networks , 2017, BMC Systems Biology.

[32]  Ruoming Jin,et al.  Topic level expertise search over heterogeneous networks , 2010, Machine Learning.

[33]  Philip S. Yu,et al.  A Survey of Heterogeneous Information Network Analysis , 2015, IEEE Transactions on Knowledge and Data Engineering.

[34]  Jongwuk Lee,et al.  Improving the accuracy of top-N recommendation using a preference model , 2016, Inf. Sci..

[35]  Qing Liu,et al.  A Partition-Based Approach to Structure Similarity Search , 2013, Proc. VLDB Endow..

[36]  Philip S. Yu,et al.  PathSim , 2011, Proc. VLDB Endow..

[37]  Wei Wang,et al.  Top-k similarity search in heterogeneous information networks with x-star network schema , 2015, Expert Syst. Appl..

[38]  Halil Kilicoglu,et al.  TextFlow: A Text Similarity Measure based on Continuous Sequences , 2017, ACL.

[39]  Philip S. Yu,et al.  HeteSim: A General Framework for Relevance Measure in Heterogeneous Networks , 2013, IEEE Transactions on Knowledge and Data Engineering.

[40]  Michael R. Lyu,et al.  PageSim: A Novel Link-Based Similarity Measure for the World Wide Web , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[41]  Hong Cheng,et al.  Graph Clustering Based on Structural/Attribute Similarities , 2009, Proc. VLDB Endow..

[42]  Jian Pei,et al.  More is Simpler: Effectively and Efficiently Assessing Node-Pair Similarities Based on Hyperlinks , 2013, Proc. VLDB Endow..

[43]  Dong-Jin Kim,et al.  SimCC: A novel method to consider both content and citations for computing similarity of scientific papers , 2016, Inf. Sci..

[44]  Masoud Reyhani Hamedani,et al.  JacSim: An accurate and efficient link-based similarity measure in graphs , 2017, Inf. Sci..

[45]  Filippo Menczer,et al.  Algorithmic Computation and Approximation of Semantic Similarity , 2006, World Wide Web.

[46]  Yizhou Sun,et al.  P-Rank: a comprehensive structural similarity measure over information networks , 2009, CIKM.

[47]  Sunju Park,et al.  C-Rank: A link-based similarity measure for scientific literature databases , 2011, Inf. Sci..

[48]  Marcin Sydow,et al.  Aspect-Based Similar Entity Search in Semantic Knowledge Graphs with Diversity-Awareness and Relaxation , 2014, 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT).

[49]  Hui Xiong,et al.  An Influence Propagation View of PageRank , 2017, ACM Trans. Knowl. Discov. Data.

[50]  Philip S. Yu,et al.  Influence and similarity on heterogeneous networks , 2012, CIKM.

[51]  Yizhou Sun,et al.  Distant Meta-Path Similarities for Text-Based Heterogeneous Information Networks , 2017, CIKM.

[52]  Hongyan Liu,et al.  Assessing single-pair similarity over graphs by aggregating first-meeting probabilities , 2014, Inf. Syst..