On link-based similarity join

Graphs can be found in applications like social networks, bibliographic networks, and biological databases. Understanding the relationship, or links, among graph nodes enables applications such as link prediction, recommendation, and spam detection. In this paper, we propose link-based similarity join (LS-join), which extends the similarity join operator to link-based measures. Given two sets of nodes in a graph, the LS-join returns all pairs of nodes that are highly similar to each other, with respect to an e-function. The e-function generalizes common measures like Personalized PageRank (PPR) and SimRank (SR). We study an efficient LS-join algorithm on a large graph. We further improve our solutions for PPR and SR, which involve expensive random-walk operations. We validate our solutions by performing extensive experiments on three real graph datasets.

[1]  David Liben-Nowell,et al.  The link-prediction problem for social networks , 2007 .

[2]  Hanan Samet,et al.  Distance join queries on spatial networks , 2006, GIS '06.

[3]  Yizhou Sun,et al.  Fast computation of SimRank for static and dynamic information networks , 2010, EDBT '10.

[4]  Christian Böhm,et al.  Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data , 2001, SIGMOD '01.

[5]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[6]  Hong Cheng,et al.  Graph Clustering Based on Structural/Attribute Similarities , 2009, Proc. VLDB Endow..

[7]  Jennifer Widom,et al.  Scaling personalized web search , 2003, WWW '03.

[8]  Dániel Fogaras,et al.  Towards Scaling Fully Personalized PageRank: Algorithms, Lower Bounds, and Experiments , 2005, Internet Math..

[9]  Hongyan Liu,et al.  Fast Single-Pair SimRank Computation , 2010, SDM.

[10]  Bernhard Seeger,et al.  GESS: a scalable similarity-join algorithm for mining large data sets in high dimensional spaces , 2001, KDD '01.

[11]  V. Mirrokni,et al.  A recommender system based on local random walks and spectral methods , 2007, WebKDD/SNA-KDD '07.

[12]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[13]  Jon M. Kleinberg,et al.  The small-world phenomenon: an algorithmic perspective , 2000, STOC '00.

[14]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[15]  Philip S. Yu,et al.  LinkClus: efficient clustering via heterogeneous semantic links , 2006, VLDB.

[16]  Pavel Velikhov,et al.  Accuracy estimate and optimization techniques for SimRank computation , 2008, Proc. VLDB Endow..

[17]  Lei Zou,et al.  DistanceJoin: Pattern Match Query In a Large Graph Database , 2009, Proc. VLDB Endow..

[18]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[19]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[20]  Purnamrita Sarkar,et al.  Fast nearest-neighbor search in disk-resident graphs , 2010, KDD.

[21]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[22]  Dániel Fogaras,et al.  Scaling link-based similarity search , 2005, WWW '05.

[23]  Pavel Velikhov,et al.  Accuracy estimate and optimization techniques for SimRank computation , 2008, The VLDB Journal.