Efficient Algorithm for Computing Link-Based Similarity in Real World Networks

Similarity calculation has many applications, such as information retrieval, and collaborative filtering, among many others. It has been shown that link-based similarity measure, such as SimRank, is very effective in characterizing the object similarities in networks, such as the Web, by exploiting the object-to-object relationship. Unfortunately, it is prohibitively expensive to compute the link-based similarity in a relatively large graph. In this paper, based on the observation that link-based similarity scores of real world graphs follow the power-law distribution, we propose a new approximate algorithm, namely Power-SimRank, with guaranteed error bound to efficiently compute link-based similarity measure. We also prove the convergence of the proposed algorithm. Extensive experiments conducted on real world datasets and synthetic datasets show that the proposed algorithm outperforms SimRank by four-five times in terms of efficiency while the error generated by the approximation is small.

[1]  Dániel Fogaras,et al.  Scaling link-based similarity search , 2005, WWW '05.

[2]  Edward A. Fox,et al.  SimFusion: measuring similarity using unified relationship matrix , 2005, SIGIR '05.

[3]  M. M. Kessler Bibliographic coupling between scientific papers , 1963 .

[4]  A. Barabasi,et al.  Lethality and centrality in protein networks , 2001, Nature.

[5]  S. Redner How popular is your paper? An empirical study of the citation distribution , 1998, cond-mat/9804163.

[6]  W. Scott Spangler,et al.  Feature Weighting in k-Means Clustering , 2003, Machine Learning.

[7]  Michalis Faloutsos,et al.  On power-law relationships of the Internet topology , 1999, SIGCOMM '99.

[8]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..

[9]  H. Bauke Parameter estimation for power-law distributions by maximum likelihood methods , 2007, 0704.1867.

[10]  Xu Jia,et al.  An Adaptive Method for the Efficient Similarity Calculation , 2009, DASFAA.

[11]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[12]  Pavel Velikhov,et al.  Accuracy estimate and optimization techniques for SimRank computation , 2008, The VLDB Journal.

[13]  Ibrahim Matta,et al.  On the origin of power laws in Internet topologies , 2000, CCRV.

[14]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[15]  Philip S. Yu,et al.  LinkClus: efficient clustering via heterogeneous semantic links , 2006, VLDB.

[16]  Pavel Velikhov,et al.  Accuracy estimate and optimization techniques for SimRank computation , 2008, Proc. VLDB Endow..

[17]  Fan Chung Graham,et al.  Random evolution in massive graphs , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[18]  Filippo Menczer,et al.  Algorithmic Computation and Approximation of Semantic Similarity , 2006, World Wide Web.

[19]  Carl D. Meyer,et al.  Deeper Inside PageRank , 2004, Internet Math..

[20]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..