PRSim: Sublinear Time SimRank Computation on Large Power-Law Graphs

SimRank is a classic measure of the similarities of nodes in a graph. Given a node u in graph $G =(V, E)$, a \em single-source SimRank query returns the SimRank similarities $s(u, v)$ between node u and each node $v \in V$. This type of queries has numerous applications in web search and social networks analysis, such as link prediction, web mining, and spam detection. Existing methods for single-source SimRank queries, however, incur query cost at least linear to the number of nodes n, which renders them inapplicable for real-time and interactive analysis. This paper proposes \prsim, an algorithm that exploits the structure of graphs to efficiently answer single-source SimRank queries. \prsim uses an index of size $O(m)$, where m is the number of edges in the graph, and guarantees a query time that depends on the \em reverse PageRank distribution of the input graph. In particular, we prove that \prsim runs in sub-linear time if the degree distribution of the input graph follows the power-law distribution, a property possessed by many real-world graphs. Based on the theoretical analysis, we show that the empirical query time of all existing SimRank algorithms also depends on the reverse PageRank distribution of the graph. Finally, we present the first experimental study that evaluates the absolute errors of various SimRank algorithms on large graphs, and we show that \prsim outperforms the state of the art in terms of query time, accuracy, index size, and scalability.

[1]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[2]  Hongyang Zhang,et al.  Approximate Personalized PageRank on Dynamic Graphs , 2016, KDD.

[3]  Mohammad Al Hasan,et al.  Representing Graphs as Bag of Vertices and Partitions for Graph Classification , 2018, Data Science and Engineering.

[4]  Xuemin Lin,et al.  Fast incremental SimRank on link-evolving graphs , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[5]  Zhipeng Zhang,et al.  An Experimental Evaluation of SimRank-based Similarity Search Algorithms , 2017, Proc. VLDB Endow..

[6]  Ioannis Antonellis,et al.  Simrank++: query rewriting through link analysis of the clickgraph (poster) , 2007, Proc. VLDB Endow..

[7]  Yu Liu,et al.  ProbeSim: Scalable Single-Source and Top-k SimRank Computations on Dynamic Graphs , 2017, Proc. VLDB Endow..

[8]  Julie A. McCann,et al.  Efficient Partial-Pairs SimRank Search for Large Networks , 2015, Proc. VLDB Endow..

[9]  Jon M. Kleinberg,et al.  The link-prediction problem for social networks , 2007, J. Assoc. Inf. Sci. Technol..

[10]  Pavel Velikhov,et al.  Accuracy estimate and optimization techniques for SimRank computation , 2008, The VLDB Journal.

[11]  Xuemin Lin,et al.  A Space and Time Efficient Algorithm for SimRank Computation , 2010, APWeb.

[12]  Yizhou Sun,et al.  Fast computation of SimRank for static and dynamic information networks , 2010, EDBT '10.

[13]  Sibo Wang,et al.  Efficient Algorithms for Finding Approximate Heavy Hitters in Personalized PageRanks , 2018, SIGMOD Conference.

[14]  Fan Chung Graham,et al.  Concentration Inequalities and Martingale Inequalities: A Survey , 2006, Internet Math..

[15]  Dániel Fogaras,et al.  Scaling link-based similarity search , 2005, WWW '05.

[16]  Qi Ye,et al.  Using Node Identifiers and Community Prior for Graph-Based Classification , 2018, Data Science and Engineering.

[17]  Xiaokui Xiao,et al.  SLING: A Near-Optimal Index Structure for SimRank , 2016, SIGMOD Conference.

[18]  Ken-ichi Kawarabayashi,et al.  Efficient SimRank Computation via Linearization , 2014, ArXiv.

[19]  Yu Liu,et al.  Towards Maximum Independent Sets on Massive Graphs , 2015, Proc. VLDB Endow..

[20]  Dániel Fogaras,et al.  Towards Scaling Fully Personalized PageRank: Algorithms, Lower Bounds, and Experiments , 2005, Internet Math..

[21]  Hong Chen,et al.  Parallel SimRank computation on large graphs with iterative aggregation , 2010, KDD.

[22]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[23]  Yue Wang,et al.  Efficient SimRank Tracking in Dynamic Graphs , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[24]  Béla Bollobás,et al.  Directed scale-free graphs , 2003, SODA '03.

[25]  Xing Xie,et al.  An Efficient Similarity Search Framework for SimRank over Large Dynamic Graphs , 2015, Proc. VLDB Endow..

[26]  Ruoming Jin,et al.  Axiomatic ranking of network role similarity , 2011, KDD.

[27]  David Liben-Nowell,et al.  The link-prediction problem for social networks , 2007 .

[28]  Ashish Goel,et al.  Fast Incremental and Personalized PageRank , 2010, Proc. VLDB Endow..

[29]  Piotr Sankowski,et al.  Algorithmic Complexity of Power Law Networks , 2015, SODA.

[30]  Raymond Chi-Wing Wong,et al.  READS: A Random Walk Approach for Efficient and Accurate Dynamic SimRank , 2017, Proc. VLDB Endow..

[31]  Jian Pei,et al.  More is Simpler: Effectively and Efficiently Assessing Node-Pair Similarities Based on Hyperlinks , 2013, Proc. VLDB Endow..

[32]  Yizhou Sun,et al.  P-Rank: a comprehensive structural similarity measure over information networks , 2009, CIKM.

[33]  Chiara Orsini,et al.  Hyperbolic graph generator , 2015, Comput. Phys. Commun..

[34]  Edward A. Fox,et al.  SimFusion: measuring similarity using unified relationship matrix , 2005, SIGIR '05.

[35]  Reynold Cheng,et al.  Walking in the Cloud: Parallel SimRank at Scale , 2015, Proc. VLDB Endow..

[36]  Sibo Wang,et al.  TopPPR: Top-k Personalized PageRank Queries with Precision Guarantees on Large Graphs , 2018, SIGMOD Conference.

[37]  Julie A. McCann,et al.  High Quality Graph-Based Similarity Search , 2015, SIGIR.

[38]  Joo Young Lee,et al.  Evaluations of Similarity Measures on VK for Link Prediction , 2018, Data Science and Engineering.

[39]  Yasuhiro Fujiwara,et al.  Efficient search algorithm for SimRank , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[40]  Laks V. S. Lakshmanan,et al.  On Top-k Structural Similarity Search , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[41]  Hanghang Tong,et al.  Panther: Fast Top-k Similarity Search on Large Networks , 2015, KDD.

[42]  Julie A. McCann,et al.  Gauging Correct Relative Rankings For Similarity Search , 2015, CIKM.

[43]  Ashish Goel,et al.  Personalized PageRank Estimation and Search: A Bidirectional Approach , 2015, WSDM.

[44]  Ken-ichi Kawarabayashi,et al.  Scalable similarity search for SimRank , 2014, SIGMOD Conference.

[45]  Jiawei Han,et al.  Survey on web spam detection: principles and algorithms , 2012, SKDD.

[46]  Moses Charikar,et al.  Finding frequent items in data streams , 2004, Theor. Comput. Sci..