ProbeSim: Scalable Single-Source and Top-k SimRank Computations on Dynamic Graphs

Single-source and top-$k$ SimRank queries are two important types of similarity search in graphs with numerous applications in web mining, social network analysis, spam detection, etc. A plethora of techniques have been proposed for these two types of queries, but very few can efficiently support similarity search over large dynamic graphs, due to either significant preprocessing time or large space overheads. This paper presents ProbeSim, an index-free algorithm for single-source and top-$k$ SimRank queries that provides a non-trivial theoretical guarantee in the absolute error of query results. ProbeSim estimates SimRank similarities without precomputing any indexing structures, and thus can naturally support real-time SimRank queries on dynamic graphs. Besides the theoretical guarantee, ProbeSim also offers satisfying practical efficiency and effectiveness due to several non-trivial optimizations. We conduct extensive experiments on a number of benchmark datasets, which demonstrate that our solutions significantly outperform the existing methods in terms of efficiency and effectiveness. Notably, our experiments include the first empirical study that evaluates the effectiveness of SimRank algorithms on graphs with billion edges, using the idea of pooling.

[1]  Dániel Fogaras,et al.  Scaling link-based similarity search , 2005, WWW '05.

[2]  Laks V. S. Lakshmanan,et al.  On Top-k Structural Similarity Search , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[3]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[4]  Ken-ichi Kawarabayashi,et al.  Scalable SimRank join algorithm , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[5]  Julie A. McCann,et al.  Gauging Correct Relative Rankings For Similarity Search , 2015, CIKM.

[6]  Hong Chen,et al.  Parallel SimRank computation on large graphs with iterative aggregation , 2010, KDD.

[7]  Yizhou Sun,et al.  P-Rank: a comprehensive structural similarity measure over information networks , 2009, CIKM.

[8]  Xiaokui Xiao,et al.  SLING: A Near-Optimal Index Structure for SimRank , 2016, SIGMOD Conference.

[9]  Yasuhiro Fujiwara,et al.  Efficient search algorithm for SimRank , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[10]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[11]  Jaana Kekäläinen,et al.  IR evaluation methods for retrieving highly relevant documents , 2000, SIGIR Forum.

[12]  Julie A. McCann,et al.  High Quality Graph-Based Similarity Search , 2015, SIGIR.

[13]  Reynold Cheng,et al.  Walking in the Cloud: Parallel SimRank at Scale , 2015, Proc. VLDB Endow..

[14]  Ioannis Antonellis,et al.  Simrank++: query rewriting through link analysis of the clickgraph (poster) , 2007, Proc. VLDB Endow..

[15]  Jon M. Kleinberg,et al.  The link-prediction problem for social networks , 2007, J. Assoc. Inf. Sci. Technol..

[16]  Lei Zou,et al.  Efficient SimRank-based Similarity Join Over Large Graphs , 2013, Proc. VLDB Endow..

[17]  Michael R. Lyu,et al.  MatchSim: a novel similarity measure based on maximum neighborhood matching , 2012, Knowledge and Information Systems.

[18]  Julie A. McCann,et al.  Efficient Partial-Pairs SimRank Search for Large Networks , 2015, Proc. VLDB Endow..

[19]  Jon Kleinberg,et al.  The link prediction problem for social networks , 2003, CIKM '03.

[20]  Pavel Velikhov,et al.  Accuracy estimate and optimization techniques for SimRank computation , 2008, The VLDB Journal.

[21]  Xuemin Lin,et al.  A Space and Time Efficient Algorithm for SimRank Computation , 2010, APWeb.

[22]  Xing Xie,et al.  An Efficient Similarity Search Framework for SimRank over Large Dynamic Graphs , 2015, Proc. VLDB Endow..

[23]  Yizhou Sun,et al.  Fast computation of SimRank for static and dynamic information networks , 2010, EDBT '10.

[24]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[25]  Guoliang Li,et al.  Efficient top-K SimRank-based similarity join , 2014, Proc. VLDB Endow..

[26]  David Liben-Nowell,et al.  The link-prediction problem for social networks , 2007 .

[27]  Ruoming Jin,et al.  Axiomatic ranking of network role similarity , 2011, KDD.

[28]  Dániel Fogaras,et al.  Towards Scaling Fully Personalized PageRank: Algorithms, Lower Bounds, and Experiments , 2005, Internet Math..

[29]  Jian Pei,et al.  More is Simpler: Effectively and Efficiently Assessing Node-Pair Similarities Based on Hyperlinks , 2013, Proc. VLDB Endow..

[30]  Xuemin Lin,et al.  Fast incremental SimRank on link-evolving graphs , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[31]  Zhipeng Zhang,et al.  An Experimental Evaluation of SimRank-based Similarity Search Algorithms , 2017, Proc. VLDB Endow..

[32]  Fan Chung Graham,et al.  Concentration Inequalities and Martingale Inequalities: A Survey , 2006, Internet Math..

[33]  Ken-ichi Kawarabayashi,et al.  Efficient SimRank Computation via Linearization , 2014, ArXiv.

[34]  Ken-ichi Kawarabayashi,et al.  Scalable similarity search for SimRank , 2014, SIGMOD Conference.

[35]  Jiawei Han,et al.  Survey on web spam detection: principles and algorithms , 2012, SKDD.