Fast and Flexible Top-k Similarity Search on Large Networks

Similarity search is a fundamental problem in network analysis and can be applied in many applications, such as collaborator recommendation in coauthor networks, friend recommendation in social networks, and relation prediction in medical information networks. In this article, we propose a sampling-based method using random paths to estimate the similarities based on both common neighbors and structural contexts efficiently in very large homogeneous or heterogeneous information networks. We give a theoretical guarantee that the sampling size depends on the error-bound ϵ, the confidence level (1-Δ), and the path length T of each random walk. We perform an extensive empirical study on a Tencent microblogging network of 1,000,000,000 edges. We show that our algorithm can return top-k similar vertices for any vertex in a network 300× faster than the state-of-the-art methods. We develop a prototype system of recommending similar authors to demonstrate the effectiveness of our method.

[1]  L. Freeman Centrality in social networks conceptual clarification , 1978 .

[2]  M. Newman,et al.  Vertex similarity in networks. , 2005, Physical review. E, Statistical, nonlinear, and soft matter physics.

[3]  Kun-Lung Wu,et al.  Counting and Sampling Triangles from a Graph Stream , 2013, Proc. VLDB Endow..

[4]  Sreenivas Gollapudi,et al.  Estimating PageRank on graph streams , 2008, PODS.

[5]  Jie Tang,et al.  Mining structural hole spanners through information diffusion in social networks , 2013, WWW.

[6]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..

[7]  Ken-ichi Kawarabayashi,et al.  Scalable similarity search for SimRank , 2014, SIGMOD Conference.

[8]  Ryan A. Rossi,et al.  Role Discovery in Networks , 2014, IEEE Transactions on Knowledge and Data Engineering.

[9]  A. Barabasi,et al.  The human disease network , 2007, Proceedings of the National Academy of Sciences.

[10]  Yongsub Lim,et al.  MASCOT: Memory-efficient and Accurate Sampling for Counting Local Triangles in Graph Streams , 2015, KDD.

[11]  Xiaowei Zhu,et al.  A Comparative Analysis on Weibo and Twitter , 2016 .

[12]  Mohammad Al Hasan,et al.  Approximate triangle counting algorithms on multi-cores , 2013, 2013 IEEE International Conference on Big Data.

[13]  Hanghang Tong,et al.  Panther: Fast Top-k Similarity Search on Large Networks , 2015, KDD.

[14]  I. Wald,et al.  On building fast kd-Trees for Ray Tracing, and on doing that in O(N log N) , 2006, 2006 IEEE Symposium on Interactive Ray Tracing.

[15]  Christos Faloutsos,et al.  It's who you know: graph mining using recursive structural features , 2011, KDD.

[16]  Christos Faloutsos,et al.  Automatic multimedia cross-modal correlation discovery , 2004, KDD.

[17]  Ryan A. Rossi,et al.  Estimation of Graphlet Statistics , 2017, ArXiv.

[18]  Laks V. S. Lakshmanan,et al.  On Top-k Structural Similarity Search , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[19]  Charalampos E. Tsourakakis Toward Quantifying Vertex Similarity in Networks , 2011, Internet Math..

[20]  Philip S. Yu,et al.  PathSim , 2011, Proc. VLDB Endow..

[21]  Jing Yang,et al.  The human disease network in terms of dysfunctional regulatory mechanisms , 2015, Biology Direct.

[22]  Steven B. Andrews,et al.  Structural Holes: The Social Structure of Competition , 1995, The SAGE Encyclopedia of Research Design.

[23]  M. M. Kessler Bibliographic coupling between scientific papers , 1963 .

[24]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[25]  Nitesh V. Chawla,et al.  Inferring user demographics and social strategies in mobile social networks , 2014, KDD.

[26]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[27]  Uri Alon,et al.  Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs , 2004, Bioinform..

[28]  Philip S. Yu,et al.  Outlier detection in graph streams , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[29]  Ronald S. Burt,et al.  DETECTING ROLE EQUIVALENCE , 1990 .

[30]  Yasuhiro Fujiwara,et al.  Efficient ad-hoc search for personalized PageRank , 2013, SIGMOD '13.

[31]  Danai Koutra,et al.  RolX: structural role extraction & mining in large graphs , 2012, KDD.

[32]  P. Jaccard,et al.  Etude comparative de la distribution florale dans une portion des Alpes et des Jura , 1901 .

[33]  Philip S. Yu,et al.  Top-k Similarity Join in Heterogeneous Information Networks , 2015, IEEE Transactions on Knowledge and Data Engineering.

[34]  Ruoming Jin,et al.  Axiomatic ranking of network role similarity , 2011, KDD.

[35]  Sebastian Wernicke,et al.  Efficient Detection of Network Motifs , 2006, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[36]  P. Holland,et al.  An Exponential Family of Probability Distributions for Directed Graphs , 1981 .

[37]  Leo Katz,et al.  A new status index derived from sociometric analysis , 1953 .

[38]  Mohammad Al Hasan,et al.  GRAFT: an approximate graphlet counting algorithm for large graph analysis , 2012, CIKM.

[39]  Christos Faloutsos,et al.  Fast Random Walk with Restart and Its Applications , 2006, Sixth International Conference on Data Mining (ICDM'06).

[40]  Minlan Yu,et al.  Stream Aggregation Through Order Sampling , 2017, CIKM.

[41]  Evgenios M. Kornaropoulos,et al.  Fast approximation of betweenness centrality through sampling , 2014, WSDM.

[42]  Roland Geraerts,et al.  Towards social behavior in virtual-agent navigation , 2016, Science China Information Sciences.

[43]  Naonori Ueda,et al.  Fast approximate similarity search based on degree-reduced neighborhood graphs , 2011, KDD.

[44]  Ramana Rao Kompella,et al.  Graph sample and hold: a framework for big-graph analytics , 2014, KDD.

[45]  Ali Pinar,et al.  Path Sampling: A Fast and Provable Method for Estimating 4-Vertex Subgraph Counts , 2014, WWW.

[46]  Leonard M. Freeman,et al.  A set of measures of centrality based upon betweenness , 1977 .

[47]  Paul Van Dooren,et al.  A MEASURE OF SIMILARITY BETWEEN GRAPH VERTICES . WITH APPLICATIONS TO SYNONYM EXTRACTION AND WEB SEARCHING , 2002 .

[48]  Philip S. Yu,et al.  HeteSim: A General Framework for Relevance Measure in Heterogeneous Networks , 2013, IEEE Transactions on Knowledge and Data Engineering.

[49]  Paul Van Dooren,et al.  A measure of similarity between graph vertices , 2004 .

[50]  Jie Tang,et al.  Who will follow you back?: reciprocal relationship prediction , 2011, CIKM '11.

[51]  Nagarajan Natarajan,et al.  Inductive matrix completion for predicting gene–disease associations , 2014, Bioinform..

[52]  Bin Chen,et al.  Assessing Drug Target Association Using Semantic Linked Data , 2012, PLoS Comput. Biol..

[53]  Yizhou Sun,et al.  RAIN: Social Role-Aware Information Diffusion , 2015, AAAI.

[54]  Feller William,et al.  An Introduction To Probability Theory And Its Applications , 1950 .

[55]  Graham Cormode,et al.  Space efficient mining of multigraph streams , 2005, PODS.

[56]  Jie Tang,et al.  ArnetMiner: extraction and mining of academic social networks , 2008, KDD.

[57]  Purnamrita Sarkar,et al.  Fast nearest-neighbor search in disk-resident graphs , 2010, KDD.

[58]  Christian Sohler,et al.  Counting triangles in data streams , 2006, PODS.

[59]  M. Newman,et al.  Finding community structure in networks using the eigenvectors of matrices. , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.