An Experimental Evaluation of SimRank-based Similarity Search Algorithms

Given a graph, SimRank is one of the most popular measures of the similarity between two vertices. We focus on efficiently calculating SimRank, which has been studied intensively over the last decade. This has led to many algorithms that efficiently calculate or approximate SimRank being proposed by researchers. Despite these abundant research efforts, there is no systematic comparison of these algorithms. In this paper, we conduct a study to compare these algorithms to understand their pros and cons. We first introduce a taxonomy for different algorithms that calculate SimRank and classify each algorithm into one of the following three classes, namely, iterative-, non-iterative-, and random walk-based method. We implement ten algorithms published from 2002 to 2015, and compare them using synthetic and real-world graphs. To ensure the fairness of our study, our implementations use the same data structure and execution framework, and we try our best to optimize each of these algorithms. Our study reveals that none of these algorithms dominates the others: algorithms based on iterative method often have higher accuracy while algorithms based on random walk can be more scalable. One noniterative algorithm has good effectiveness and efficiency on graphs with medium size. Thus, depending on the requirements of different applications, the optimal choice of algorithms differs. This paper provides an empirical guideline for making such choices.

[1]  Ioannis Antonellis,et al.  Simrank++: query rewriting through link analysis of the clickgraph (poster) , 2007, Proc. VLDB Endow..

[2]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[3]  Edward A. Fox,et al.  SimFusion: measuring similarity using unified relationship matrix , 2005, SIGIR '05.

[4]  Lin Ma,et al.  PAGE: A Partition Aware Engine for Parallel Graph Computation , 2015, IEEE Transactions on Knowledge and Data Engineering.

[5]  V. Mirrokni,et al.  A recommender system based on local random walks and spectral methods , 2007, WebKDD/SNA-KDD '07.

[6]  Jon M. Kleinberg,et al.  The link-prediction problem for social networks , 2007, J. Assoc. Inf. Sci. Technol..

[7]  Julie A. McCann,et al.  Efficient Partial-Pairs SimRank Search for Large Networks , 2015, Proc. VLDB Endow..

[8]  Bradley N. Miller,et al.  GroupLens: applying collaborative filtering to Usenet news , 1997, CACM.

[9]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[10]  Hongyan Liu,et al.  Exploiting the Block Structure of Link Graph for Efficient Similarity Computation , 2009, PAKDD.

[11]  Dániel Fogaras,et al.  Scaling link-based similarity search , 2005, WWW '05.

[12]  András A. Benczúr,et al.  Link-Based Similarity Search to Fight Web Spam , 2006, AIRWeb.

[13]  Xing Xie,et al.  An Efficient Similarity Search Framework for SimRank over Large Dynamic Graphs , 2015, Proc. VLDB Endow..

[14]  Xuemin Lin,et al.  Towards efficient SimRank computation on large networks , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[15]  Yasuhiro Fujiwara,et al.  Efficient search algorithm for SimRank , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[16]  Pattie Maes,et al.  Social information filtering: algorithms for automating “word of mouth” , 1995, CHI '95.

[17]  Laks V. S. Lakshmanan,et al.  On Top-k Structural Similarity Search , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[18]  Pavel Velikhov,et al.  Accuracy estimate and optimization techniques for SimRank computation , 2008, The VLDB Journal.

[19]  Conrad Sanderson,et al.  Armadillo: An Open Source C++ Linear Algebra Library for Fast Prototyping and Computationally Intensive Experiments , 2010 .

[20]  Lei Zou,et al.  Efficient SimRank-based Similarity Join Over Large Graphs , 2013, Proc. VLDB Endow..

[21]  Lin Ma,et al.  Parallel subgraph listing in a large-scale graph , 2014, SIGMOD Conference.

[22]  Lei Chen,et al.  Efficient cohesive subgraphs detection in parallel , 2014, SIGMOD Conference.

[23]  Beng Chin Ooi,et al.  Big data: the driver for innovation in databases , 2014 .

[24]  Guoliang Li,et al.  Efficient top-K SimRank-based similarity join , 2014, Proc. VLDB Endow..

[25]  Ken-ichi Kawarabayashi,et al.  Scalable similarity search for SimRank , 2014, SIGMOD Conference.

[26]  Hongyan Liu,et al.  Fast Single-Pair SimRank Computation , 2010, SDM.

[27]  Douglas B. Terry,et al.  Using collaborative filtering to weave an information tapestry , 1992, CACM.

[28]  Yizhou Sun,et al.  Fast computation of SimRank for static and dynamic information networks , 2010, EDBT '10.