论文信息 - Estimating Cardinality for Arbitrarily Large Data Stream With Improved Memory Efficiency

Estimating Cardinality for Arbitrarily Large Data Stream With Improved Memory Efficiency

Cardinality estimation is the task of determining the number of distinct elements (or the cardinality) in a data stream, under a stringent constraint that the input data stream can be scanned by just one single pass. This is a fundamental problem with many practical applications, such as traffic monitoring of high-speed networks and query optimization of Internet-scale database. To solve the problem, we propose an algorithm named HLL-TailCut, which implements the estimation standard error $1.0 / \sqrt {m}$ using the memory units of four or three bits each, whose cost is much smaller than the five-bit memory units used by HyperLogLog, the best previously known cardinality estimator. This makes it possible to reduce the memory cost of HyperLogLog by 20%~45%. For example, when the target estimation error is 1.1%, state-of-the-art HyperLogLog needs 5.6 kilobytes memory. By contrast, our new algorithm only needs 3 kilobytes memory consumption for attaining the same accuracy. Additionally, our algorithm is able to support the estimation of very large stream cardinalities, even on the Tera and Peta scale.

[1] Vyas Sekar,et al. Data streaming algorithms for estimating entropy of network traffic , 2006, SIGMETRICS '06/Performance '06.

[2] Minlan Yu,et al. Software Defined Traffic Measurement with OpenSketch , 2013, NSDI.

[3] Alexander Hall,et al. HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm , 2013, EDBT '13.

[4] Björn Scheuermann,et al. High-Speed Per-Flow Traffic Measurement with Probabilistic Multiplicity Counting , 2010, 2010 Proceedings IEEE INFOCOM.

[5] George Varghese,et al. Bitmap algorithms for counting active flows on high speed links , 2003, IMC '03.

[6] Luca Trevisan,et al. Counting Distinct Elements in a Data Stream , 2002, RANDOM.

[7] P. Flajolet,et al. HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm , 2007 .

[8] Philippe Flajolet,et al. Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[9] Srinivasan Seshan,et al. Synopsis diffusion for robust aggregation in sensor networks , 2004, SenSys '04.

[10] Gerhard Weikum,et al. Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[11] Peter J. Haas,et al. On synopses for distinct-value estimation under multiset operations , 2007, SIGMOD '07.

[12] Jih-Kwon Peir,et al. Fit a Spread Estimator in Small Memory , 2009, IEEE INFOCOM 2009.

[13] Kamel Aouiche,et al. A comparison of five probabilistic view-size estimation techniques in OLAP , 2007, DOLAP '07.

[14] Kyu-Young Whang,et al. A linear-time probabilistic counting algorithm for database applications , 1990, TODS.

[15] P. Flajolet,et al. Loglog counting of large cardinalities , 2003 .

[16] Yi Lu,et al. Robust Counting Via Counter Braids: An Error-Resilient Network Measurement Architecture , 2009, IEEE INFOCOM 2009.

[17] Jing Cao,et al. Identifying High Cardinality Internet Hosts , 2009, IEEE INFOCOM 2009.

[18] Amr El Abbadi,et al. Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic , 2008, EDBT '08.

[19] Abhishek Kumar,et al. Data streaming algorithms for efficient and accurate estimation of flow size distribution , 2004, SIGMETRICS '04/Performance '04.