Distinct Counting With a Self-Learning Bitmap

Counting the number of distinct elements (cardinality) in a dataset is a fundamental problem in database management. In recent years, there has been significant interest to address the distinct counting problem in a data stream setting, where each incoming data can be seen only once and cannot be stored for long periods of time. Many probabilistic approaches based on either sampling or sketching have been proposed in the computer science literature that only require limited computing and memory resources. However, the performances of these methods are not scale invariant, in the sense that their relative root mean square estimation errors (RRMSE) depend on the unknown cardinalities. This is not desirable in many applications where cardinalities can be dynamic or inhomogeneous and many cardinalities need to be estimated. In this article, we develop a novel approach, called self-learning bitmap (S-bitmap) that is scale invariant for cardinalities in a specified range. S-bitmap uses a binary vector whose entries are updated from 0 to 1 by an adaptive sampling process for inferring the unknown cardinality, where the sampling rates are reduced sequentially as more and more entries change from 0 to 1. We prove rigorously that the S-bitmap estimate is not only unbiased but scale invariant. We demonstrate that to achieve a small RRMSE value of ε or less, our approach requires significantly less memory and uses similar or fewer operations than state-of-the-art methods for many common practice cardinality scales. Both simulation and experimental studies are reported.

[1]  P. Haas,et al.  Estimating the Number of Classes in a Finite Population , 1998 .

[2]  Frédéric Giroire,et al.  Order statistics and estimating cardinalities of massive data sets , 2009, Discret. Appl. Math..

[3]  Walter A. Rosenkrantz,et al.  Approximate counting:a martingale approach , 1987 .

[4]  David P. Woodruff,et al.  An optimal algorithm for the distinct elements problem , 2010, PODS '10.

[5]  Kyu-Young Whang,et al.  A linear-time probabilistic counting algorithm for database applications , 1990, TODS.

[6]  Phillip B. Gibbons Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports , 2001, VLDB.

[7]  Kyu-Young Whang,et al.  Approximating the number of unique values of an attribute without sorting , 1987, Inf. Syst..

[8]  Carolyn Pillers Dobler,et al.  Mathematical Statistics , 2002 .

[9]  George Varghese,et al.  Bitmap algorithms for counting active flows on high speed links , 2003, IMC '03.

[10]  J. Bunge,et al.  Estimating the Number of Species: A Review , 1993 .

[11]  Chang Xuan Mao,et al.  Inference on the Number of Species Through Geometric Lower Bounds , 2006 .

[12]  P. Flajolet,et al.  Loglog counting of large cardinalities , 2003 .

[13]  Tian Bu,et al.  Design and Evaluation of a Fast and Robust Worm Detection Algorithm , 2006, Proceedings IEEE INFOCOM 2006. 25TH IEEE International Conference on Computer Communications.

[14]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[15]  Amr El Abbadi,et al.  Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic , 2008, EDBT '08.

[16]  Dimitrios Stiliadis,et al.  Nobot: Embedded malware detection for endpoint devices , 2011, Bell Labs Technical Journal.

[17]  Peter J. Haas,et al.  Distinct-value synopses for multiset operations , 2009, CACM.

[18]  Robert H. Morris,et al.  Counting large numbers of events in small registers , 1978, CACM.

[19]  Brian Rexroad,et al.  Wide-Scale Botnet Detection and Characterization , 2007, HotBots.

[20]  Jin Cao,et al.  Distinct Counting with a Self-Learning Bitmap , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[21]  Philippe Flajolet,et al.  Adaptive Sampling , 1997 .

[22]  P. Flajolet,et al.  HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm , 2007 .