Distinct Counting with a Self-Learning Bitmap

Estimating the number of distinct values is a fundamental problem in database that has attracted extensive research over the past two decades, due to its wide applications (especially in the Internet). Many algorithms have been proposed via sampling or sketching for obtaining statistical estimates that only require limited computing and memory resources. However, their performance in terms of relative estimation accuracy usually depends on the unknown cardinalities. In this paper, we address the following question: can a distinct counting algorithm have uniformly reliable performance, i.e. constant relative estimation errors for unknown cardinalities in a wide range, say from tens to millions? We propose a self-learning bitmap algorithm (S-bitmap) to answer this question. The S-bitmap is a bitmap obtained via a novel adaptive sampling process, where the bits corresponding to the sampled items are set to 1, and the sampling rates are learned from the number of distinct items already passed and reduced sequentially as more bits are set to 1.A unique property of S-bitmap is that its relative estimation error is truly stabilized, i.e. invariant to unknown cardinalities in a prescribed range. We demonstrate through both theoretical and empirical studies that with a given memory requirement, S-bitmap is not only uniformly reliable but more accurate than state-of-the-art algorithms such as the multiresolution bitmap \cite{bitmap:2006} and Hyper LogLog algorithms \cite{flajolet.et.al.07} under common practice settings.

[1]  P. Bickel,et al.  Mathematical Statistics: Basic Ideas and Selected Topics , 1977 .

[2]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[3]  P. Haas,et al.  Estimating the Number of Classes in a Finite Population , 1998 .

[4]  Dimitrios Stiliadis,et al.  Nobot: Embedded malware detection for endpoint devices , 2011, Bell Labs Technical Journal.

[5]  Kyu-Young Whang,et al.  Approximating the number of unique values of an attribute without sorting , 1987, Inf. Syst..

[6]  Donald Ervin Knuth,et al.  The Art of Computer Programming, 2nd Ed. (Addison-Wesley Series in Computer Science and Information , 1978 .

[7]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[8]  Robert H. Morris,et al.  Counting large numbers of events in small registers , 1978, CACM.

[9]  George Varghese,et al.  Bitmap algorithms for counting active flows on high speed links , 2003, IMC '03.

[10]  Edith Cohen,et al.  Size-Estimation Framework with Applications to Transitive Closure and Reachability , 1997, J. Comput. Syst. Sci..

[11]  David P. Woodruff,et al.  An optimal algorithm for the distinct elements problem , 2010, PODS '10.

[12]  Frédéric Giroire,et al.  Order statistics and estimating cardinalities of massive data sets , 2009, Discret. Appl. Math..

[13]  Philippe Flajolet,et al.  Adaptive Sampling , 1997 .

[14]  Brian Rexroad,et al.  Wide-Scale Botnet Detection and Characterization , 2007, HotBots.

[15]  P. Flajolet,et al.  HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm , 2007 .

[16]  P. Flajolet,et al.  Loglog counting of large cardinalities , 2003 .

[17]  Sumit Ganguly,et al.  Counting distinct items over update streams , 2005, Theor. Comput. Sci..

[18]  R. Durrett Probability: Theory and Examples , 1993 .

[19]  Chang Xuan Mao,et al.  Inference on the Number of Species Through Geometric Lower Bounds , 2006 .

[20]  Donald Ervin Knuth,et al.  The art of computer programming, , Volume III, 2nd Edition , 1998 .

[21]  Peter J. Haas,et al.  On synopses for distinct-value estimation under multiset operations , 2007, SIGMOD '07.

[22]  Peter J. Haas,et al.  Distinct-value synopses for multiset operations , 2009, CACM.

[23]  Amr El Abbadi,et al.  Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic , 2008, EDBT '08.

[24]  David Thomas,et al.  The Art in Computer Programming , 2001 .

[25]  Larry Shepp,et al.  Distinct Counting With a Self-Learning Bitmap , 2011 .

[26]  J. Bunge,et al.  Estimating the Number of Species: A Review , 1993 .

[27]  Phillip B. Gibbons Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports , 2001, VLDB.

[28]  Walter A. Rosenkrantz,et al.  Approximate counting:a martingale approach , 1987 .

[29]  Tian Bu,et al.  Design and Evaluation of a Fast and Robust Worm Detection Algorithm , 2006, Proceedings IEEE INFOCOM 2006. 25TH IEEE International Conference on Computer Communications.

[30]  Luca Trevisan,et al.  Counting Distinct Elements in a Data Stream , 2002, RANDOM.

[31]  Kyu-Young Whang,et al.  A linear-time probabilistic counting algorithm for database applications , 1990, TODS.