Graph sample and hold: a framework for big-graph analytics

Sampling is a standard approach in big-graph analytics; the goal is to efficiently estimate the graph properties by consulting a sample of the whole population. A perfect sample is assumed to mirror every property of the whole population. Unfortunately, such a perfect sample is hard to collect in complex populations such as graphs (e.g. web graphs, social networks), where an underlying network connects the units of the population. Therefore, a good sample will be representative in the sense that graph properties of interest can be estimated with a known degree of accuracy. While previous work focused particularly on sampling schemes to estimate certain graph properties (e.g. triangle count), much less is known for the case when we need to estimate various graph properties with the same sampling scheme. In this paper, we pro- pose a generic stream sampling framework for big-graph analytics, called Graph Sample and Hold (gSH), which samples from massive graphs sequentially in a single pass, one edge at a time, while maintaining a small state in memory. We use a Horvitz-Thompson construction in conjunction with a scheme that samples arriving edges without adjacencies to previously sampled edges with probability p and holds edges with adjacencies with probability q. Our sample and hold framework facilitates the accurate estimation of subgraph patterns by enabling the dependence of the sampling process to vary based on previous history. Within our framework, we show how to produce statistically unbiased estimators for various graph properties from the sample. Given that the graph analytics will run on a sample instead of the whole population, the runtime complexity is kept under control. Moreover, given that the estimators are unbiased, the approximation error is also kept under control. Finally, we test the performance of the proposed framework (gSH) on various types of graphs, showing that from a sample with -- 40K edges, it produces estimates with relative errors < 1%.

[1]  Mohammad Ghodsi,et al.  New Streaming Algorithms for Counting Triangles in Graphs , 2005, COCOON.

[2]  O. Frank Sampling and estimation in large social networks , 1978 .

[3]  Yossi Matias,et al.  New sampling-based summary statistics for improving approximate query answers , 1998, SIGMOD '98.

[4]  Michael Stonebraker,et al.  Load Shedding in a Data Stream Manager , 2003, VLDB.

[5]  Wei Fan StreamMiner: A Classifier Ensemble-based Engine to Mine Concept-drifting Data Streams , 2004, VLDB.

[6]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[7]  Ryan A. Rossi,et al.  Fast maximum clique algorithms for large graphs , 2014, WWW.

[8]  Mason A. Porter,et al.  Social Structure of Facebook Networks , 2011, ArXiv.

[9]  Marios Hadjieleftheriou,et al.  Finding frequent items in data streams , 2008, Proc. VLDB Endow..

[10]  Mohammad Al Hasan,et al.  Output Space Sampling for Graph Patterns , 2009, Proc. VLDB Endow..

[11]  Philip S. Yu,et al.  Outlier detection in graph streams , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[12]  Philip S. Yu,et al.  On Clustering Graph Streams , 2010, SDM.

[13]  George Varghese,et al.  New directions in traffic measurement and accounting , 2002, CCRV.

[14]  Ziv Bar-Yossef,et al.  Reductions in streaming algorithms, with an application to counting triangles in graphs , 2002, SODA '02.

[15]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[16]  Anirban Dasgupta,et al.  Social sampling , 2012, KDD.

[17]  Tamara G. Kolda,et al.  Fast Triangle Counting through Wedge Sampling , 2012, ArXiv.

[18]  Prabhakar Raghavan,et al.  Computing on data streams , 1999, External Memory Algorithms.

[19]  Tanya Y. Berger-Wolf,et al.  Sampling community structure , 2010, WWW '10.

[20]  Kun-Lung Wu,et al.  Counting and Sampling Triangles from a Graph Stream , 2013, Proc. VLDB Endow..

[21]  Deepayan Chakrabarti,et al.  Preserving Personalized Pagerank in Subgraphs , 2011, ICML.

[22]  Luca Becchetti,et al.  Efficient semi-streaming algorithms for local triangle counting in massive graphs , 2008, KDD.

[23]  Tamara G. Kolda,et al.  Wedge sampling for computing clustering coefficients and triangle counts on large graphs † , 2013, Stat. Anal. Data Min..

[24]  Christos Faloutsos,et al.  DOULION: counting triangles in massive graphs with a coin , 2009, KDD.

[25]  Ramana Rao Kompella,et al.  Network Sampling: From Static to Streaming Graphs , 2012, TKDD.

[26]  Jian Zhang,et al.  A Survey on Streaming Algorithms for Massive Graphs , 2010, Managing and Mining Graph Data.

[27]  Rajeev Motwani,et al.  Sampling from a moving window over streaming data , 2002, SODA '02.

[28]  Sreenivas Gollapudi,et al.  Estimating PageRank on graph streams , 2008, PODS.

[29]  Tanya Y. Berger-Wolf,et al.  Benefits of bias: towards better characterization of network sampling , 2011, KDD.

[30]  A. L. Narasimha Reddy,et al.  Identifying Long-Term High-Bandwidth Flows at a Router , 2001, HiPC.

[31]  Philip S. Yu,et al.  On dense pattern mining in graph streams , 2010, Proc. VLDB Endow..

[32]  Jon M. Kleinberg,et al.  Network bucket testing , 2011, WWW.

[33]  Christos Faloutsos,et al.  Sampling from large graphs , 2006, KDD '06.

[34]  Christian Sohler,et al.  Counting triangles in data streams , 2006, PODS.

[35]  Tamara G. Kolda,et al.  Counting Triangles in Massive Graphs with MapReduce , 2013, SIAM J. Sci. Comput..

[36]  Charu C. Aggarwal,et al.  gSketch: On Query Estimation in Graph Streams , 2011, Proc. VLDB Endow..

[37]  Tamara G. Kolda,et al.  Triadic Measures on Graphs: The Power of Wedge Sampling , 2012, SDM.

[38]  Lei Chen,et al.  Continuous Subgraph Pattern Search over Certain and Uncertain Graph Streams , 2010, IEEE Transactions on Knowledge and Data Engineering.

[39]  Graham Cormode,et al.  Space efficient mining of multigraph streams , 2005, PODS.

[40]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[41]  Raffaele Giancarlo,et al.  On finding common neighborhoods in massive graphs , 2003, Theor. Comput. Sci..

[42]  Thomas Schank,et al.  Algorithmic Aspects of Triangle-Based Network Analysis , 2007 .

[43]  David Williams,et al.  Probability with Martingales , 1991, Cambridge mathematical textbooks.

[44]  Eric D. Kolaczyk,et al.  Statistical Analysis of Network Data , 2009 .

[45]  Ramana Rao Kompella,et al.  Network Sampling Designs for Relational Classification , 2012, ICWSM.

[46]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[47]  Edith Cohen,et al.  Don't let the negatives bring you down: sampling from streams of signed updates , 2012, SIGMETRICS '12.

[48]  M. Schervish Theory of Statistics , 1995 .

[49]  Carsten Lund,et al.  Algorithms and estimators for accurate summarization of internet traffic , 2007, IMC '07.