Second-order random walk-based proximity measures in graph analysis: formulations and algorithms

Measuring the proximity between different nodes is a fundamental problem in graph analysis. Random walk-based proximity measures have been shown to be effective and widely used. Most existing random walk measures are based on the first-order Markov model, i.e., they assume that the next step of the random surfer only depends on the current node. However, this assumption neither holds in many real-life applications nor captures the clustering structure in the graph. To address the limitation of the existing first-order measures, in this paper, we study the second-order random walk measures, which take the previously visited node into consideration. While the existing first-order measures are built on node-to-node transition probabilities, in the second-order random walk, we need to consider the edge-to-edge transition probabilities. Using incidence matrices, we develop simple and elegant matrix representations for the second-order proximity measures. A desirable property of the developed measures is that they degenerate to their original first-order forms when the effect of the previous step is zero. We further develop Monte Carlo methods to efficiently compute the second-order measures and provide theoretical performance guarantees. Experimental results show that in a variety of applications, the second-order measures can dramatically improve the performance compared to their first-order counterparts.

[1]  Jing Li,et al.  Robust Local Community Detection: On Free Rider Effect and Its Elimination , 2015, Proc. VLDB Endow..

[2]  R. Bucklin,et al.  Click Here for Internet Insight: Advances in Clickstream Data Analysis in Marketing , 2009 .

[3]  Jure Leskovec,et al.  Tensor Spectral Clustering for Partitioning Higher-order Network Structures , 2015, SDM.

[4]  David Liben-Nowell,et al.  The link-prediction problem for social networks , 2007 .

[5]  Georgia Koutrika,et al.  A Survey on Proximity Measures for Social Networks , 2012, SeCO Book.

[6]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[7]  Linyuan Lu,et al.  Old and new concentration inequalities , 2006 .

[8]  Martin Rosvall,et al.  Memory in network flows and its effects on spreading dynamics and community detection , 2013, Nature Communications.

[9]  Hong Chen,et al.  Parallel SimRank computation on large graphs with iterative aggregation , 2010, KDD.

[10]  François Fouss,et al.  Random-Walk Computation of Similarities between Nodes of a Graph with Application to Collaborative Recommendation , 2007, IEEE Transactions on Knowledge and Data Engineering.

[11]  George Michailidis,et al.  Graph-Based Semi-Supervised Learning With Big Data , 2020, Cognitive Analytics.

[12]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[13]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[14]  Yizhou Sun,et al.  Fast computation of SimRank for static and dynamic information networks , 2010, EDBT '10.

[15]  Xuemin Lin,et al.  Taming Computational Complexity: Efficient and Parallel SimRank Optimizations on Undirected Graphs , 2010, WAIM.

[16]  David F. Gleich,et al.  PageRank beyond the Web , 2014, SIAM Rev..

[17]  Hinrich Schütze,et al.  CoSimRank: A Flexible & Efficient Graph-Theoretic Similarity Measure , 2014, ACL.

[18]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[19]  Kevin Chen-Chuan Chang,et al.  RoundTripRank: Graph-based proximity with importance and specificity? , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[20]  Yasuhiro Fujiwara,et al.  Efficient search algorithm for SimRank , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[21]  Gang Chen,et al.  Evaluating geo-social influence in location-based social networks , 2012, CIKM.

[22]  Carl D. Meyer,et al.  Deeper Inside PageRank , 2004, Internet Math..

[23]  Ken-ichi Kawarabayashi,et al.  Scalable similarity search for SimRank , 2014, SIGMOD Conference.

[24]  Dániel Fogaras,et al.  Towards Scaling Fully Personalized PageRank: Algorithms, Lower Bounds, and Experiments , 2005, Internet Math..

[25]  Partha Pratim Talukdar,et al.  Graph-Based Semi-Supervised Learning , 2014, Graph-Based Semi-Supervised Learning.

[26]  Kyomin Jung,et al.  LinkSCAN*: Overlapping community detection using the link-space transformation , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[27]  Ruoming Jin,et al.  Efficient and Exact Local Search for Random Walk Based Top-K Proximity Query in Large Graphs , 2016, IEEE Transactions on Knowledge and Data Engineering.

[28]  Leo Katz,et al.  A new status index derived from sociometric analysis , 1953 .

[29]  Jon Kleinberg,et al.  The link prediction problem for social networks , 2003, CIKM '03.

[30]  Jennifer Widom,et al.  Scaling personalized web search , 2003, WWW '03.

[31]  Jian Pei,et al.  More is Simpler: Effectively and Efficiently Assessing Node-Pair Similarities Based on Hyperlinks , 2013, Proc. VLDB Endow..

[32]  Christos Faloutsos,et al.  R-MAT: A Recursive Model for Graph Mining , 2004, SDM.

[33]  Christian Bizer,et al.  Graph structure in the web: aggregated by pay-level domain , 2014, WebSci '14.

[34]  A. Raftery A model for high-order Markov chains , 1985 .

[35]  Christos Faloutsos,et al.  Fast Random Walk with Restart and Its Applications , 2006, Sixth International Conference on Data Mining (ICDM'06).

[36]  Xiaojin Zhu,et al.  Introduction to Semi-Supervised Learning , 2009, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[37]  Purnamrita Sarkar,et al.  A Tractable Approach to Finding Closest Truncated-commute-time Neighbors in Large Graphs , 2007, UAI.

[38]  Fan Chung Graham,et al.  Local Graph Partitioning using PageRank Vectors , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[39]  Carl D. Meyer,et al.  Matrix Analysis and Applied Linear Algebra , 2000 .

[40]  Linyuan Lu,et al.  Link Prediction in Complex Networks: A Survey , 2010, ArXiv.

[41]  Ken-ichi Kawarabayashi,et al.  Efficient SimRank Computation via Linearization , 2014, ArXiv.

[42]  Jure Leskovec,et al.  Higher-order organization of complex networks , 2016, Science.

[43]  Purnamrita Sarkar,et al.  Fast nearest-neighbor search in disk-resident graphs , 2010, KDD.

[44]  Xiang Zhang,et al.  Remember Where You Came From: On The Second-Order Random Walk Based Proximity Measures , 2016, Proc. VLDB Endow..

[45]  Ruoming Jin,et al.  Fast and unified local search for random walk based k-nearest-neighbor query in large graphs , 2014, SIGMOD Conference.

[46]  Dániel Fogaras,et al.  Scaling link-based similarity search , 2005, WWW '05.

[47]  Lawrence D. Jackel,et al.  Handwritten Digit Recognition with a Back-Propagation Network , 1989, NIPS.

[48]  David F. Gleich,et al.  Multilinear PageRank , 2014, SIAM J. Matrix Anal. Appl..

[49]  F. Radicchi,et al.  Benchmark graphs for testing community detection algorithms. , 2008, Physical review. E, Statistical, nonlinear, and soft matter physics.

[50]  Kenneth Ward Church,et al.  Query suggestion using hitting time , 2008, CIKM '08.

[51]  James Hendler,et al.  Google’s PageRank and Beyond: The Science of Search Engine Rankings , 2007 .