Order statistics and estimating cardinalities of massive data sets

A new class of algorithms to estimate the cardinality of very large multisets using constant memory and doing only one pass on the data is introduced here. It is based on order statistics rather than on bit patterns in binary representations of numbers. Three families of estimators are analyzed. They attain a standard error of 1M using M units of storage, which places them in the same class as the best known algorithms so far. The algorithms have a very simple internal loop, which gives them an advantage in terms of processing speed. For instance, a memory of only 12 kB and only few seconds are sufficient to process a multiset with several million elements and to build an estimate with accuracy of order 2 percent. The algorithms are validated both by mathematical analysis and by experimentations on real internet traffic.

[1]  M. Hazewinkel Encyclopaedia of mathematics , 1987 .

[2]  Daniel Panario,et al.  Proceedings of the Ninth Workshop on Algorithm Engineering and Experiments and the Fourth Workshop on Analytic Algorithms and Combinatorics , 2007 .

[3]  José D. P. Rolim,et al.  Proceedings of the 6th International Workshop on Randomization and Approximation Techniques , 2002 .

[4]  Philippe Flajolet,et al.  Adaptive Sampling , 1997 .

[5]  David Thomas,et al.  The Art in Computer Programming , 2001 .

[6]  Jeffrey Considine,et al.  Approximate aggregation techniques for sensor databases , 2004, Proceedings. 20th International Conference on Data Engineering.

[7]  P. Chassaing,et al.  Efficient estimation of the cardinality of large data sets , 2007, math/0701347.

[8]  Andrei Z. Broder,et al.  Identifying and Filtering Near-Duplicate Documents , 2000, CPM.

[9]  Frédéric Giroire Réseaux, algorithmique et analyse combinatoire de grands ensembles , 2006 .

[10]  Luca Trevisan,et al.  Counting Distinct Elements in a Data Stream , 2002, RANDOM.

[11]  Philippe Owezarski,et al.  Design and Deployment of a Passive Monitoring Infrastructure , 2001, IWDC.

[12]  Bin Ma,et al.  Proceedings of the 18th annual symposium on Combinatorial Pattern Matching , 2007 .

[13]  P. Flajolet,et al.  HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm , 2007 .

[14]  Ben Taskar,et al.  Selectivity estimation using probabilistic models , 2001, SIGMOD '01.

[15]  Frédéric Giroire,et al.  Estimating the Number of Active Flows in a Data Stream over a Sliding Window , 2007, ANALCO.

[16]  Philippe Flajolet,et al.  Probabilistic counting , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[17]  Nick McKeown,et al.  Monitoring very high speed links , 2001, IMW '01.

[19]  P. Flajolet,et al.  Loglog counting of large cardinalities , 2003 .

[20]  M. V. Wilkes,et al.  The Art of Computer Programming, Volume 3, Sorting and Searching , 1974 .

[21]  Kyu-Young Whang,et al.  A linear-time probabilistic counting algorithm for database applications , 1990, TODS.

[22]  George Varghese,et al.  Bitmap algorithms for counting active flows on high speed links , 2003, IMC '03.

[23]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[24]  Donald E. Knuth,et al.  The art of computer programming, volume 3: (2nd ed.) sorting and searching , 1998 .

[25]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[26]  Phillip B. Gibbons Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports , 2001, VLDB.