SimRank computation on uncertain graphs

SimRank is a similarity measure between vertices in a graph, which has become a fundamental technique in graph analytics. Recently, many algorithms have been proposed for efficient evaluation of SimRank similarities. However, the existing SimRank computation algorithms either overlook uncertainty in graph structures or is based on an unreasonable assumption (Du et al). In this paper, we study SimRank similarities on uncertain graphs based on the possible world model of uncertain graphs. Following the random-walk-based formulation of SimRank on deterministic graphs and the possible worlds model of uncertain graphs, we define random walks on uncertain graphs for the first time and show that our definition of random walks satisfies Markov's property. We formulate the SimRank measure based on random walks on uncertain graphs. We discover a critical difference between random walks on uncertain graphs and random walks on deterministic graphs, which makes all existing SimRank computation algorithms on deterministic graphs inapplicable to uncertain graphs. To efficiently compute SimRank similarities, we propose three algorithms, namely the baseline algorithm with high accuracy, the sampling algorithm with high efficiency, and the two-phase algorithm with comparable efficiency as the sampling algorithm and about an order of magnitude smaller relative error than the sampling algorithm. The extensive experiments and case studies verify the effectiveness of our SimRank measure and the efficiency of our SimRank computation algorithms.

[1]  Hongyan Liu,et al.  Assessing single-pair similarity over graphs by aggregating first-meeting probabilities , 2014, Inf. Syst..

[2]  M. Newman,et al.  Finding community structure in networks using the eigenvectors of matrices. , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[3]  Ken-ichi Kawarabayashi,et al.  Scalable similarity search for SimRank , 2014, SIGMOD Conference.

[4]  Christos Faloutsos,et al.  R-MAT: A Recursive Model for Graph Mining , 2004, SDM.

[5]  Haixun Wang,et al.  Distance-Constraint Reachability Computation in Uncertain Graphs , 2011, Proc. VLDB Endow..

[6]  Lise Getoor,et al.  Entity Resolution in Graphs , 2005 .

[7]  Charu C. Aggarwal,et al.  Discovering highly reliable subgraphs in uncertain graphs , 2011, KDD.

[8]  Xuemin Lin,et al.  Towards efficient SimRank computation on large networks , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[9]  Mark Newman,et al.  Detecting community structure in networks , 2004 .

[10]  Yizhou Sun,et al.  Fast computation of SimRank for static and dynamic information networks , 2010, EDBT '10.

[11]  Philip S. Yu,et al.  Object Distinction: Distinguishing Objects with Identical Names , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[12]  Yi Pan,et al.  Detecting Protein Complexes Based on Uncertain Graph Model , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[13]  John Alan Gerlt,et al.  Sequence Similarity Networks for the Protein Universe , 2015 .

[14]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[15]  Joseph Douglas Horton,et al.  A Polynomial-Time Algorithm to Find the Shortest Cycle Basis of a Graph , 1987, SIAM J. Comput..

[16]  Xing Xie,et al.  An Efficient Similarity Search Framework for SimRank over Large Dynamic Graphs , 2015, Proc. VLDB Endow..

[17]  George Kollios,et al.  Clustering Large Probabilistic Graphs , 2013, IEEE Transactions on Knowledge and Data Engineering.

[18]  Yizhou Sun,et al.  P-Rank: a comprehensive structural similarity measure over information networks , 2009, CIKM.

[19]  L. R. Dice Measures of the Amount of Ecologic Association Between Species , 1945 .

[20]  Jian Pei,et al.  More is Simpler: Effectively and Efficiently Assessing Node-Pair Similarities Based on Hyperlinks , 2013, Proc. VLDB Endow..

[21]  J. Cooper,et al.  A novel pathway that coordinates mitotic exit with spindle position. , 2007, Molecular biology of the cell.

[22]  George Kollios,et al.  k-nearest neighbors in uncertain graphs , 2010, Proc. VLDB Endow..

[23]  Jianzhong Li,et al.  Structural-Context Similarities for Uncertain Graphs , 2013, 2013 IEEE 13th International Conference on Data Mining.

[24]  Erhard Rahm,et al.  Similarity flooding: a versatile graph matching algorithm and its application to schema matching , 2002, Proceedings 18th International Conference on Data Engineering.

[25]  Hong Chen,et al.  Probabilistic SimRank computation over uncertain graphs , 2015, Inf. Sci..

[26]  Laks V. S. Lakshmanan,et al.  On Top-k Structural Similarity Search , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[27]  Xuemin Lin,et al.  Fast incremental SimRank on link-evolving graphs , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[28]  Xuemin Lin,et al.  A Space and Time Efficient Algorithm for SimRank Computation , 2010, APWeb.

[29]  Jianzhong Li,et al.  Finding top-k maximal cliques in an uncertain graph , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[30]  François Fouss,et al.  Random-Walk Computation of Similarities between Nodes of a Graph with Application to Collaborative Recommendation , 2007, IEEE Transactions on Knowledge and Data Engineering.

[31]  P. Jaccard,et al.  Etude comparative de la distribution florale dans une portion des Alpes et des Jura , 1901 .

[32]  Jianzhong Li,et al.  EIF: A Framework of Effective Entity Identification , 2010, WAIM.

[33]  Ruoming Jin,et al.  Axiomatic ranking of network role similarity , 2011, KDD.

[34]  Julie A. McCann,et al.  Efficient Partial-Pairs SimRank Search for Large Networks , 2015, Proc. VLDB Endow..

[35]  Jianzhong Li,et al.  Frequent subgraph pattern mining on uncertain graph data , 2009, CIKM.

[36]  Hongyan Liu,et al.  Fast Single-Pair SimRank Computation , 2010, SDM.

[37]  Yasuhiro Fujiwara,et al.  Efficient search algorithm for SimRank , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[38]  Pavel Velikhov,et al.  Accuracy estimate and optimization techniques for SimRank computation , 2008, The VLDB Journal.

[39]  Dániel Fogaras,et al.  Scaling link-based similarity search , 2005, WWW '05.

[40]  Xiang Li,et al.  On link-based similarity join , 2011, Proc. VLDB Endow..

[41]  Guoliang Li,et al.  Efficient top-K SimRank-based similarity join , 2014, Proc. VLDB Endow..

[42]  András A. Benczúr,et al.  Link-Based Similarity Search to Fight Web Spam , 2006, AIRWeb.

[43]  Ken-ichi Kawarabayashi,et al.  Scalable SimRank join algorithm , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[44]  Lei Zou,et al.  Efficient SimRank-based Similarity Join Over Large Graphs , 2013, Proc. VLDB Endow..