Topological Graph Sketching for Incremental and Scalable Analytics

We propose a novel, scalable, and principled graph sketching technique based on minwise hashing of local neighborhood. For an n-node graph with e-edges (e >> n), we incrementally maintain in real-time a minwise neighbor sampled subgraph using k hash functions in O(n x k) memory, limit being user-configurable by the parameter k. Symmetrization and similarity based techniques can recover from these data structures a significant portion of the original graph. We present theoretical analysis of the minwise sampling strategy and also derive unbiased estimators for important graph properties such as triangle count and neighborhood overlap. We perform an extensive empirical evaluation of our graph sketch and it's derivatives on a wide variety of real-world graph data sets drawn from different application domains using important large network analysis algorithms: local and global clustering coefficient, PageRank, and local graph sparsification. With bounded memory, the quality of results using the sketch representation is competitive against baselines which use the full graph, and the computational performance is often better. Our framework is flexible and configurable to be leveraged by numerous other graph analytics algorithms, potentially reducing the information mining time on large streamed graphs for a variety of applications.

[1]  Linyuan Lu,et al.  Link Prediction in Complex Networks: A Survey , 2010, ArXiv.

[2]  Christos Faloutsos,et al.  DOULION: counting triangles in massive graphs with a coin , 2009, KDD.

[3]  Ravi Kumar,et al.  Discovering Large Dense Subgraphs in Massive Graphs , 2005, VLDB.

[4]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[5]  Christos Faloutsos,et al.  Sampling from large graphs , 2006, KDD '06.

[6]  Mark S. Granovetter The Strength of Weak Ties , 1973, American Journal of Sociology.

[7]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[8]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[9]  Micah Adler,et al.  Towards compressing Web graphs , 2001, Proceedings DCC 2001. Data Compression Conference.

[10]  Ramana Rao Kompella,et al.  Graph sample and hold: a framework for big-graph analytics , 2014, KDD.

[11]  Purnamrita Sarkar,et al.  The Big Data Bootstrap , 2012, ICML.

[12]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[13]  Qing Chen,et al.  Graph Stream Summarization: From Big Bang to Big Crunch , 2016, SIGMOD Conference.

[14]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[15]  Sudipto Guha,et al.  Graph sketches: sparsification, spanners, and subgraphs , 2012, PODS.

[16]  Anna C. Gilbert,et al.  Compressing Network Graphs , 2004 .

[17]  Srinivasan Parthasarathy,et al.  Symmetrizations for clustering directed graphs , 2011, EDBT/ICDT '11.

[18]  Luca Becchetti,et al.  Efficient semi-streaming algorithms for local triangle counting in massive graphs , 2008, KDD.

[19]  Rok Sosic,et al.  SNAP , 2016, ACM Trans. Intell. Syst. Technol..

[20]  Charu C. Aggarwal,et al.  gSketch: On Query Estimation in Graph Streams , 2011, Proc. VLDB Endow..

[21]  Piotr Indyk,et al.  A small approximately min-wise independent family of hash functions , 1999, SODA '99.

[22]  Srinivasan Parthasarathy,et al.  Local graph sparsification for scalable clustering , 2011, SIGMOD '11.

[23]  Charalampos E. Tsourakakis,et al.  Space- and Time-Efficient Algorithm for Maintaining Dense Subgraphs on One-Pass Dynamic Streams , 2015, STOC.

[24]  David R. Karger,et al.  Random sampling in cut, flow, and network design problems , 1994, STOC '94.

[25]  Srinivasan Parthasarathy,et al.  Efficient community detection in large networks using content and links , 2012, WWW.

[26]  Alan M. Frieze,et al.  Min-Wise Independent Linear Permutations , 2000, Electron. J. Comb..

[27]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[28]  Torsten Suel,et al.  Compressing the graph structure of the Web , 2001, Proceedings DCC 2001. Data Compression Conference.

[29]  Tamara G. Kolda,et al.  Triadic Measures on Graphs: The Power of Wedge Sampling , 2012, SDM.