A Framework for Estimating Stream Expression Cardinalities

Given $m$ distributed data streams $A_1, \dots, A_m$, we consider the problem of estimating the number of unique identifiers in streams defined by set expressions over $A_1, \dots, A_m$. We identify a broad class of algorithms for solving this problem, and show that the estimators output by any algorithm in this class are perfectly unbiased and satisfy strong variance bounds. Our analysis unifies and generalizes a variety of earlier results in the literature. To demonstrate its generality, we describe several novel sampling algorithms in our class, and show that they achieve a novel tradeoff between accuracy, space usage, update speed, and applicability.

[1]  Edith Cohen,et al.  Leveraging discarded samples for tighter estimation of multiple-set aggregates , 2009, SIGMETRICS '09.

[2]  Martin Sauerhoff,et al.  Applying Approximate Counting for Computing the Frequency Moments of Long Data Streams , 2009, Theory of Computing Systems.

[3]  Edith Cohen,et al.  All-Distances Sketches, Revisited: HIP Estimators for Massive Graphs Analysis , 2013, IEEE Transactions on Knowledge and Data Engineering.

[4]  Felix Schlenk,et al.  Proof of Theorem 3 , 2005 .

[5]  Daniel Ting,et al.  Streamed approximate counting of distinct elements: beating optimal batch methods , 2014, KDD.

[6]  Moni Naor,et al.  Backyard Cuckoo Hashing: Constant Worst-Case Operations with a Succinct Representation , 2009, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[7]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[8]  Alexander Hall,et al.  HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm , 2013, EDBT '13.

[9]  Moni Naor,et al.  De-amortized Cuckoo Hashing: Provable Worst-Case Performance and Experimental Results , 2009, ICALP.

[10]  P. Flajolet,et al.  HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm , 2007 .

[11]  Peter J. Haas,et al.  Distinct-value synopses for multiset operations , 2009, CACM.

[12]  Philippe Flajolet,et al.  Approximate counting: A detailed analysis , 1985, BIT.

[13]  Frédéric Giroire,et al.  Order statistics and estimating cardinalities of massive data sets , 2009, Discret. Appl. Math..

[14]  Philippe Flajolet,et al.  Adaptive Sampling , 1997 .

[15]  Carsten Lund,et al.  Priority sampling for estimation of arbitrary subset sums , 2007, JACM.

[16]  Edith Cohen,et al.  Summarizing data using bottom-k sketches , 2007, PODC '07.

[17]  Jin Cao,et al.  A Simple and Efficient Estimation Method for Stream Expression Cardinalities , 2007, VLDB.

[18]  Robert H. Morris,et al.  Counting large numbers of events in small registers , 1978, CACM.

[19]  C. A. R. Hoare,et al.  Algorithm 65: find , 1961, Commun. ACM.

[20]  Luca Trevisan,et al.  Counting Distinct Elements in a Data Stream , 2002, RANDOM.

[21]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[22]  David P. Woodruff,et al.  A General Method for Estimating Correlated Aggregates Over a Data Stream , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[23]  Mikkel Thorup,et al.  Bottom-k and priority sampling, set similarity and subset sums with minimal independence , 2013, STOC '13.

[24]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[25]  Donald E. Knuth,et al.  The art of computer programming: sorting and searching (volume 3) , 1973 .

[26]  Phillip B. Gibbons Distinct-Values Estimation over Data Streams , 2016, Data Stream Management.

[27]  David P. Woodruff,et al.  An optimal algorithm for the distinct elements problem , 2010, PODS '10.

[28]  Rajeev Rastogi,et al.  Processing set expressions over continuous update streams , 2003, SIGMOD '03.

[29]  Srikanta Tirthapura,et al.  Estimating simple functions on the union of data streams , 2001, SPAA '01.

[30]  Donald E. Knuth,et al.  The art of computer programming, volume 3: (2nd ed.) sorting and searching , 1998 .