Boosting distinct random sampling for basic counting on the union of distributed streams

We revisit the classic basic counting problem in the distributed streaming model. In the solution for maintaining an ( ? , ? ) -estimate, we make the following new contributions: (1) For a bit stream of size n, where each bit has a probability at least γ to be 1, we exponentially reduced the average total processing time from the best prior work's ? ( n log ? ( 1 / ? ) ) to O ( ( 1 / ( γ ? 2 ) ) ( log 2 ? n ) log ? ( 1 / ? ) ) , thus providing the first sublinear-time streaming algorithm for this problem. (2) In addition to an overall much faster processing speed, our method provides a new tradeoff that a lower accuracy demand (a larger value for ?) promises a faster processing speed, whereas the best prior work's processing speed is ? ( n log ? ( 1 / ? ) ) in any case and for any ?. (3) The worst-case total time cost of our method matches the best prior work's ? ( n log ? ( 1 / ? ) ) , which is necessary but rarely occurs in our method. (4) The space usage overhead in our method is a lower order term compared with the best prior work's space usage and occurs only O ( log ? n ) times during the stream processing and is too negligible to be detected by the OS in practice. We further validate these theoretical results with experiments on both real-world and synthetic data, showing that our method is faster than the best prior work by a factor of several to several hundreds depending on the stream size and accuracy demands, without any detectable space usage overhead. Our method is based on a faster sampling technique that we design for boosting the sampling procedure in the best prior work and we believe this technique can be of other independent interest.

[1]  Srikanta Tirthapura,et al.  Estimating simple functions on the union of data streams , 2001, SPAA '01.

[2]  Srikanta Tirthapura,et al.  A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window , 2007, STACS.

[3]  E. Fischer THE ART OF UNINFORMED DECISIONS: A PRIMER TO PROPERTY TESTING , 2004 .

[4]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[5]  Luca Trevisan,et al.  Counting Distinct Elements in a Data Stream , 2002, RANDOM.

[6]  S. Muthukrishnan,et al.  Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries , 2001, VLDB.

[7]  David P. Woodruff,et al.  Tight lower bounds for the distinct elements problem , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[8]  Bojian Xu,et al.  Boosting the basic counting on distributed streams , 2013, SSDBM '14.

[9]  Srikanta Tirthapura,et al.  Sketching asynchronous data streams over sliding windows , 2008, Distributed Computing.

[10]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[11]  Rajeev Motwani,et al.  Load Shedding in Data Stream Systems , 2007, Data Streams - Models and Algorithms.

[12]  Piotr Indyk,et al.  Maintaining Stream Statistics over Sliding Windows , 2002, SIAM J. Comput..

[13]  Dana Ron Property Testing: A Learning Theory Perspective , 2008, Found. Trends Mach. Learn..

[14]  Phillip B. Gibbons Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports , 2001, VLDB.

[15]  Graham Cormode,et al.  Algorithms for distributed functional monitoring , 2008, SODA '08.

[16]  Jeffrey Scott Vitter,et al.  Faster methods for random sampling , 1984, CACM.

[17]  Ronitt Rubinfeld,et al.  Sublinear Time Algorithms , 2011, SIAM J. Discret. Math..

[18]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[19]  Oded Goldreich,et al.  Combinatorial property testing (a survey) , 1997, Randomization Methods in Algorithm Design.

[20]  Alexander Wolff,et al.  Computing large matchings fast , 2008, SODA '08.

[21]  E. Kushilevitz,et al.  Communication Complexity: Basics , 1996 .

[22]  Srikanta Tirthapura,et al.  Distributed Streams Algorithms for Sliding Windows , 2004, Theory of Computing Systems.

[23]  Srikanta Tirthapura,et al.  Range-Efficient Counting of Distinct Elements in a Massive Data Stream , 2007, SIAM J. Comput..

[24]  Larry Carter,et al.  Universal Classes of Hash Functions , 1979, J. Comput. Syst. Sci..

[25]  Graham Cormode,et al.  Time-decaying Sketches for Robust Aggregation of Sensor Data , 2009, SIAM J. Comput..

[26]  Daniel Panario,et al.  Handbook of Finite Fields , 2013, Discrete mathematics and its applications.

[27]  Graham Cormode,et al.  What’s Different: Distributed, Continuous Monitoring of Duplicate-Resilient Aggregates on Data Streams , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[28]  Graham Cormode,et al.  Holistic aggregates in a networked world: distributed tracking of approximate quantiles , 2005, SIGMOD '05.

[29]  Michael Stonebraker,et al.  Load Shedding in a Data Stream Manager , 2003, VLDB.