Estimating Cardinality for Arbitrarily Large Data Stream With Improved Memory Efficiency

Cardinality estimation is the task of determining the number of distinct elements (or the cardinality) in a data stream, under a stringent constraint that the input data stream can be scanned by just one single pass. This is a fundamental problem with many practical applications, such as traffic monitoring of high-speed networks and query optimization of Internet-scale database. To solve the problem, we propose an algorithm named HLL-TailCut, which implements the estimation standard error $1.0 / \sqrt {m}$ using the memory units of four or three bits each, whose cost is much smaller than the five-bit memory units used by HyperLogLog, the best previously known cardinality estimator. This makes it possible to reduce the memory cost of HyperLogLog by 20%~45%. For example, when the target estimation error is 1.1%, state-of-the-art HyperLogLog needs 5.6 kilobytes memory. By contrast, our new algorithm only needs 3 kilobytes memory consumption for attaining the same accuracy. Additionally, our algorithm is able to support the estimation of very large stream cardinalities, even on the Tera and Peta scale.

[1]  Vyas Sekar,et al.  Data streaming algorithms for estimating entropy of network traffic , 2006, SIGMETRICS '06/Performance '06.

[2]  Minlan Yu,et al.  Software Defined Traffic Measurement with OpenSketch , 2013, NSDI.

[3]  Alexander Hall,et al.  HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm , 2013, EDBT '13.

[4]  Björn Scheuermann,et al.  High-Speed Per-Flow Traffic Measurement with Probabilistic Multiplicity Counting , 2010, 2010 Proceedings IEEE INFOCOM.

[5]  George Varghese,et al.  Bitmap algorithms for counting active flows on high speed links , 2003, IMC '03.

[6]  Luca Trevisan,et al.  Counting Distinct Elements in a Data Stream , 2002, RANDOM.

[7]  P. Flajolet,et al.  HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm , 2007 .

[8]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[9]  Srinivasan Seshan,et al.  Synopsis diffusion for robust aggregation in sensor networks , 2004, SenSys '04.

[10]  Gerhard Weikum,et al.  Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[11]  Peter J. Haas,et al.  On synopses for distinct-value estimation under multiset operations , 2007, SIGMOD '07.

[12]  Jih-Kwon Peir,et al.  Fit a Spread Estimator in Small Memory , 2009, IEEE INFOCOM 2009.

[13]  Kamel Aouiche,et al.  A comparison of five probabilistic view-size estimation techniques in OLAP , 2007, DOLAP '07.

[14]  Kyu-Young Whang,et al.  A linear-time probabilistic counting algorithm for database applications , 1990, TODS.

[15]  P. Flajolet,et al.  Loglog counting of large cardinalities , 2003 .

[16]  Yi Lu,et al.  Robust Counting Via Counter Braids: An Error-Resilient Network Measurement Architecture , 2009, IEEE INFOCOM 2009.

[17]  Jing Cao,et al.  Identifying High Cardinality Internet Hosts , 2009, IEEE INFOCOM 2009.

[18]  Amr El Abbadi,et al.  Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic , 2008, EDBT '08.

[19]  Abhishek Kumar,et al.  Data streaming algorithms for efficient and accurate estimation of flow size distribution , 2004, SIGMETRICS '04/Performance '04.