论文信息 - Scalable similarity search for SimRank

Scalable similarity search for SimRank

SimRank, proposed by Jeh and Widom, provides a good similarity score and has been successfully used in many of the above mentioned applications. While there are many algorithms proposed so far to compute SimRank, but unfortunately, none of them are scalable up to graphs of billions size. Motivated by this fact, we consider the following SimRank-based similarity search problem: given a query vertex u, find top-k vertices v with the k highest SimRank scores s(u,v) with respect to u. We propose a very fast and scalable algorithm for this similarity search problem. Our method consists of the following ingredients: (1) We first introduce a "linear" recursive formula for SimRank. This allows us to formulate a problem that we can propose a very fast algorithm. (2) We establish a Monte-Carlo based algorithm to compute a single pair SimRank score s(u,v), which is based on the random-walk interpretation of our linear recursive formula. (3) We empirically show that SimRank score s(u,v) decreases rapidly as distance d(u,v) increases. Therefore, in order to compute SimRank scores for a query vertex u for our similarity search problem, we only need to look at very "local" area. (4) We can combine two upper bounds for SimRank score s(u,v) (which can be obtained by Monte-Carlo simulation in our preprocess), together with some adaptive sample technique, to prune the similarity search procedure. This results in a much faster algorithm. Once our preprocess is done (which only takes O(n) time), our algorithm finds, given a query vertex u, top-20 similar vertices v with the 20 highest SimRank scores s(u,v) in less than a few seconds even for graphs with billions edges. To the best of our knowledge, this is the first time to scale for graphs with at least billions edges(for the single source case).

[1] Xuemin Lin,et al. A space and time efficient algorithm for SimRank computation , 2010, 2010 12th International Asia-Pacific Web Conference.

[2] Hong Cheng,et al. Graph Clustering Based on Structural/Attribute Similarities , 2009, Proc. VLDB Endow..

[3] Jian Pei,et al. More is Simpler: Effectively and Efficiently Assessing Node-Pair Similarities Based on Hyperlinks , 2013, Proc. VLDB Endow..

[4] Xu Jia,et al. A Fast Two-Stage Algorithm for Computing SimRank and Its Extensions , 2010, WAIM Workshops.

[5] Ioannis Antonellis,et al. Simrank++: query rewriting through link analysis of the clickgraph (poster) , 2007, Proc. VLDB Endow..

[6] Jon M. Kleinberg,et al. The link-prediction problem for social networks , 2007, J. Assoc. Inf. Sci. Technol..

[7] Yizhou Sun,et al. P-Rank: a comprehensive structural similarity measure over information networks , 2009, CIKM.

[8] Virginia Vassilevska Williams,et al. Multiplying matrices faster than coppersmith-winograd , 2012, STOC '12.

[9] Michael R. Lyu,et al. PageSim: A Novel Link-Based Similarity Measure for the World Wide Web , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[10] Albert-László Barabási,et al. Internet: Diameter of the World-Wide Web , 1999, Nature.

[11] Xuemin Lin,et al. Taming Computational Complexity: Efficient and Parallel SimRank Optimizations on Undirected Graphs , 2010, WAIM.

[12] Hongyan Liu,et al. Exploiting the Block Structure of Link Graph for Efficient Similarity Computation , 2009, PAKDD.

[13] Dániel Fogaras,et al. Scaling link-based similarity search , 2005, WWW '05.

[14] Yizhou Sun,et al. Fast computation of SimRank for static and dynamic information networks , 2010, EDBT '10.

[15] V. Strassen. Gaussian elimination is not optimal , 1969 .

[16] Henry G. Small,et al. Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..

[17] Hector Garcia-Molina,et al. Combating Web Spam with TrustRank , 2004, VLDB.

[18] Michael R. Lyu,et al. Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure , 2007, IEEE/WIC/ACM International Conference on Web Intelligence (WI'07).

[19] Indranil Gupta,et al. Delta-SimRank computing on MapReduce , 2012, BigMine '12.

[20] Marco Rosa,et al. Layered label propagation: a multiresolution coordinate-free ordering for compressing social networks , 2010, WWW.

[21] Pavel Velikhov,et al. Accuracy estimate and optimization techniques for SimRank computation , 2008, The VLDB Journal.