A Simple and Efficient Estimation Method for Stream Expression Cardinalities

Estimating the cardinality (i.e. number of distinct elements) of an arbitrary set expression defined over multiple distributed streams is one of the most fundamental queries of interest. Earlier methods based on probabilistic sketches have focused mostly on the sketching algorithms. However, the estimators do not fully utilize the information in the sketches and thus are not statistically efficient. In this paper, we develop a novel statistical model and an efficient yet simple estimator for the cardinalities based on a continuous variant of the well known Flajolet-Martin sketches. Specifically, we show that, for two streams, our estimator has almost the same statistical efficiency as the Maximum Likelihood Estimator (MLE), which is known to be optimal in the sense of Cramer-Rao lower bounds under regular conditions. Moreover, as the number of streams gets larger, our estimator is still computationally simple, but the MLE becomes intractable due to the complexity of the likelihood. Let N be the cardinality of the union of all streams, and vSv be the cardinality of a set expression S to be estimated. For a given relative standard error δ, the memory requirement of our estimator is O(δ-2vSv-1 N log log N), which is superior to state-of-the-art algorithms, especially for large N and small vSv/N where the estimation is most challenging.

[1]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[2]  Joseph M. Hellerstein,et al.  Proof Sketches: Verifiable In-Network Aggregation , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[3]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[4]  Jianping Pan,et al.  Fast and accurate traffic matrix measurement using adaptive cardinality counting , 2005, MineNet '05.

[5]  Rajeev Rastogi,et al.  Tracking set-expression cardinalities over continuous update streams , 2004, The VLDB Journal.

[6]  Luca Trevisan,et al.  Counting Distinct Elements in a Data Stream , 2002, RANDOM.

[7]  Philippe Flajolet,et al.  Probabilistic counting , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[8]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[9]  Phillip B. Gibbons Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports , 2001, VLDB.

[10]  Abhishek Kumar,et al.  Data streaming algorithms for accurate and efficient measurement of traffic and flow matrices , 2005, SIGMETRICS '05.

[11]  George Varghese,et al.  Bitmap algorithms for counting active flows on high speed links , 2003, IMC '03.

[12]  Piotr Indyk,et al.  A small approximately min-wise independent family of hash functions , 1999, SODA '99.

[13]  Frédéric Giroire,et al.  Order statistics and estimating cardinalities of massive data sets , 2009, Discret. Appl. Math..

[14]  P. Bickel,et al.  Mathematical Statistics: Basic Ideas and Selected Topics , 1977 .

[15]  Noga Alon,et al.  The Probabilistic Method , 2015, Fundamentals of Ramsey Theory.

[16]  C. PillersDobler,et al.  Mathematical Statistics: Basic Ideas and Selected Topics (vol. 1, 2nd ed.) , 2002 .

[17]  P. Flajolet,et al.  Loglog counting of large cardinalities , 2003 .

[18]  Sumit Ganguly,et al.  Counting distinct items over update streams , 2005, Theor. Comput. Sci..

[19]  Jeffrey Considine,et al.  Approximate aggregation techniques for sensor databases , 2004, Proceedings. 20th International Conference on Data Engineering.