Scalable similarity search for SimRank

SimRank, proposed by Jeh and Widom, provides a good similarity score and has been successfully used in many of the above mentioned applications. While there are many algorithms proposed so far to compute SimRank, but unfortunately, none of them are scalable up to graphs of billions size. Motivated by this fact, we consider the following SimRank-based similarity search problem: given a query vertex u, find top-k vertices v with the k highest SimRank scores s(u,v) with respect to u. We propose a very fast and scalable algorithm for this similarity search problem. Our method consists of the following ingredients: (1) We first introduce a "linear" recursive formula for SimRank. This allows us to formulate a problem that we can propose a very fast algorithm. (2) We establish a Monte-Carlo based algorithm to compute a single pair SimRank score s(u,v), which is based on the random-walk interpretation of our linear recursive formula. (3) We empirically show that SimRank score s(u,v) decreases rapidly as distance d(u,v) increases. Therefore, in order to compute SimRank scores for a query vertex u for our similarity search problem, we only need to look at very "local" area. (4) We can combine two upper bounds for SimRank score s(u,v) (which can be obtained by Monte-Carlo simulation in our preprocess), together with some adaptive sample technique, to prune the similarity search procedure. This results in a much faster algorithm. Once our preprocess is done (which only takes O(n) time), our algorithm finds, given a query vertex u, top-20 similar vertices v with the 20 highest SimRank scores s(u,v) in less than a few seconds even for graphs with billions edges. To the best of our knowledge, this is the first time to scale for graphs with at least billions edges(for the single source case).

[1]  Xuemin Lin,et al.  A space and time efficient algorithm for SimRank computation , 2010, 2010 12th International Asia-Pacific Web Conference.

[2]  Hong Cheng,et al.  Graph Clustering Based on Structural/Attribute Similarities , 2009, Proc. VLDB Endow..

[3]  Jian Pei,et al.  More is Simpler: Effectively and Efficiently Assessing Node-Pair Similarities Based on Hyperlinks , 2013, Proc. VLDB Endow..

[4]  Xu Jia,et al.  A Fast Two-Stage Algorithm for Computing SimRank and Its Extensions , 2010, WAIM Workshops.

[5]  Ioannis Antonellis,et al.  Simrank++: query rewriting through link analysis of the clickgraph (poster) , 2007, Proc. VLDB Endow..

[6]  Jon M. Kleinberg,et al.  The link-prediction problem for social networks , 2007, J. Assoc. Inf. Sci. Technol..

[7]  Yizhou Sun,et al.  P-Rank: a comprehensive structural similarity measure over information networks , 2009, CIKM.

[8]  Virginia Vassilevska Williams,et al.  Multiplying matrices faster than coppersmith-winograd , 2012, STOC '12.

[9]  Michael R. Lyu,et al.  PageSim: A Novel Link-Based Similarity Measure for the World Wide Web , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[10]  Albert-László Barabási,et al.  Internet: Diameter of the World-Wide Web , 1999, Nature.

[11]  Xuemin Lin,et al.  Taming Computational Complexity: Efficient and Parallel SimRank Optimizations on Undirected Graphs , 2010, WAIM.

[12]  Hongyan Liu,et al.  Exploiting the Block Structure of Link Graph for Efficient Similarity Computation , 2009, PAKDD.

[13]  Dániel Fogaras,et al.  Scaling link-based similarity search , 2005, WWW '05.

[14]  Yizhou Sun,et al.  Fast computation of SimRank for static and dynamic information networks , 2010, EDBT '10.

[15]  V. Strassen Gaussian elimination is not optimal , 1969 .

[16]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..

[17]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[18]  Michael R. Lyu,et al.  Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure , 2007, IEEE/WIC/ACM International Conference on Web Intelligence (WI'07).

[19]  Indranil Gupta,et al.  Delta-SimRank computing on MapReduce , 2012, BigMine '12.

[20]  Marco Rosa,et al.  Layered label propagation: a multiresolution coordinate-free ordering for compressing social networks , 2010, WWW.

[21]  Pavel Velikhov,et al.  Accuracy estimate and optimization techniques for SimRank computation , 2008, The VLDB Journal.

[22]  M. M. Kessler,et al.  Bibliographic coupling extended in time: Ten case histories , 1963, Inf. Storage Retr..

[23]  Christian Scheible,et al.  Sentiment Translation through Lexicon Induction , 2010, ACL.

[24]  Jure Leskovec,et al.  Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters , 2008, Internet Math..

[25]  Krishna P. Gummadi,et al.  Measurement and analysis of online social networks , 2007, IMC '07.

[26]  Michael R. Lyu,et al.  MatchSim: a novel similarity measure based on maximum neighborhood matching , 2012, Knowledge and Information Systems.

[27]  Sebastiano Vigna,et al.  The webgraph framework I: compression techniques , 2004, WWW '04.

[28]  Amy Nicole Langville,et al.  Google's PageRank and beyond - the science of search engine rankings , 2006 .

[29]  Lei Zou,et al.  Efficient SimRank-based Similarity Join Over Large Graphs , 2013, Proc. VLDB Endow..

[30]  Yasuhiro Fujiwara,et al.  Efficient search algorithm for SimRank , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[31]  András A. Benczúr,et al.  Link-Based Similarity Search to Fight Web Spam , 2006, AIRWeb.

[32]  V. Mirrokni,et al.  A recommender system based on local random walks and spectral methods , 2007, WebKDD/SNA-KDD '07.

[33]  Xu Jia,et al.  Efficient Algorithm for Computing Link-Based Similarity in Real World Networks , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[34]  Hong Chen,et al.  Parallel SimRank computation on large graphs with iterative aggregation , 2010, KDD.

[35]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[36]  Hongyan Liu,et al.  Fast Single-Pair SimRank Computation , 2010, SDM.

[37]  Xu Jia,et al.  Calculating Similarity Efficiently in a Small World , 2009, ADMA.

[38]  Philip S. Yu,et al.  LinkClus: efficient clustering via heterogeneous semantic links , 2006, VLDB.

[39]  Andrew McCallum,et al.  Automating the Construction of Internet Portals with Machine Learning , 2000, Information Retrieval.

[40]  Zhaoxin Shu,et al.  SimRate: Improve Collaborative Recommendation Based on Rating Graph for Sparsity , 2010, ADMA.