Stream Aggregation Through Order Sampling

This paper introduces a new single-pass reservoir weighted-sampling stream aggregation algorithm, Priority-Based Aggregation (PBA). While order sampling is a powerful and efficient method for weighted sampling from a stream of uniquely keyed items, there is no current algorithm that realizes the benefits of order sampling in the context of stream aggregation over non-unique keys. A naive approach to order sample regardless of key then aggregate the results is hopelessly inefficient. In distinction, our proposed algorithm uses a single persistent random variable across the lifetime of each key in the cache, and maintains unbiased estimates of the key aggregates that can be queried at any point in the stream. The basic approach can be supplemented with a Sample and Hold pre-sampling stage with a sampling rate adaptation controlled by PBA. This approach represents a considerable reduction in computational complexity compared with the state of the art in adapting Sample and Hold to operate with a fixed cache size. Concerning statistical properties, we prove that PBA provides unbiased estimates of the true aggregates. We analyze the computational complexity of PBA and its variants, and provide a detailed evaluation of its accuracy on synthetic and trace data. Weighted relative error is reduced by 40% to 65% at sampling rates of 5% to 17%, relative to Adaptive Sample and Hold; there is also substantial improvement for rank queries.

[1]  Divyakant Agrawal,et al.  Efficient Computation of Frequent and Top-k Elements in Data Streams , 2005, ICDT.

[2]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[3]  Minlan Yu,et al.  Software Defined Traffic Measurement with OpenSketch , 2013, NSDI.

[4]  Jure Leskovec,et al.  Understanding Behaviors that Lead to Purchasing: A Case Study of Pinterest , 2016, KDD.

[5]  Lukasz Golab,et al.  Smart Meter Data Analytics , 2017, ACM Trans. Database Syst..

[6]  Ryan A. Rossi,et al.  On Sampling from Massive Graph Streams , 2017, Proc. VLDB Endow..

[7]  Carsten Lund,et al.  Priority sampling for estimation of arbitrary subset sums , 2007, JACM.

[8]  Cristian Estan,et al.  New directions in traffic measurement and accounting , 2001, IMW '01.

[9]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[10]  B. Rosén Asymptotic theory for order sampling , 1997 .

[11]  Vladimir Braverman,et al.  One Sketch to Rule Them All: Rethinking Network Flow Monitoring with UnivMon , 2016, SIGCOMM.

[12]  Yossi Matias,et al.  New sampling-based summary statistics for improving approximate query answers , 1998, SIGMOD '98.

[13]  Mikkel Thorup,et al.  The power of simple tabulation hashing , 2010, STOC.

[14]  Graham Cormode,et al.  An Improved Data Stream Summary: The Count-Min Sketch and Its Applications , 2004, LATIN.

[15]  Paul G. Spirakis,et al.  Weighted random sampling with a reservoir , 2006, Inf. Process. Lett..

[16]  David P. Woodruff,et al.  1-pass relative-error Lp-sampling with applications , 2010, SODA '10.

[17]  Alex C. Snoeren,et al.  Inside the Social Network's (Datacenter) Network , 2015, Comput. Commun. Rev..

[18]  Balachander Krishnamurthy,et al.  Efficient sampling for better OSN data provisioning , 2016, 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[19]  Alexandr Andoni,et al.  Streaming Algorithms from Precision Sampling , 2010, ArXiv.

[20]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[21]  George Varghese,et al.  New directions in traffic measurement and accounting , 2002, CCRV.

[22]  Piotr Indyk,et al.  Stable distributions, pseudorandom generators, embeddings, and data stream computation , 2006, JACM.

[23]  Anja Feldmann,et al.  Efficient policies for carrying Web traffic over flow-switched networks , 1998, TNET.

[24]  Sudipto Guha,et al.  Fast, small-space algorithms for approximate histogram maintenance , 2002, STOC '02.

[25]  R. Ostrovsky,et al.  Zero-one frequency laws , 2010, STOC '10.

[26]  Ramesh Govindan,et al.  SCREAM: sketch resource allocation for software-defined measurement , 2015, CoNEXT.

[27]  B. Rosén Asymptotic Theory for Successive Sampling with Varying Probabilities Without Replacement, II , 1972 .

[28]  David Moore,et al.  A robust system for accurate real-time summaries of internet traffic , 2005, SIGMETRICS '05.

[29]  Theodore Johnson,et al.  Gigascope: a stream database for network applications , 2003, SIGMOD '03.

[30]  Carsten Lund,et al.  Algorithms and estimators for accurate summarization of internet traffic , 2007, IMC '07.

[31]  ThorupMikkel,et al.  The Power of Simple Tabulation Hashing , 2012 .

[32]  Minlan Yu,et al.  FlowRadar: A Better NetFlow for Data Centers , 2016, NSDI.

[33]  Edith Cohen,et al.  Don't let the negatives bring you down: sampling from streams of signed updates , 2012, SIGMETRICS '12.

[34]  Edith Cohen,et al.  Summarizing data using bottom-k sketches , 2007, PODC '07.

[35]  Hossein Jowhari,et al.  Tight bounds for Lp samplers, finding duplicates in streams, and related problems , 2010, PODS.

[36]  J. Lindenstrauss,et al.  Extensions of lipschitz maps into Banach spaces , 1986 .

[37]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[38]  David Williams,et al.  Probability with Martingales , 1991, Cambridge mathematical textbooks.

[39]  Edith Cohen,et al.  Stream Sampling for Frequency Cap Statistics , 2015, KDD.

[40]  Edith Cohen,et al.  Tighter estimation using bottom k sketches , 2008, Proc. VLDB Endow..

[41]  Ramana Rao Kompella,et al.  Graph sample and hold: a framework for big-graph analytics , 2014, KDD.