New cardinality estimation algorithms for HyperLogLog sketches

This paper presents new methods to estimate the cardinalities of data sets recorded by HyperLogLog sketches. A theoretically motivated extension to the original estimator is presented that eliminates the bias for small and large cardinalities. Based on the maximum likelihood principle a second unbiased method is derived together with a robust and efficient numerical algorithm to calculate the estimate. The maximum likelihood approach can also be applied to more than a single HyperLogLog sketch. In particular, it is shown that it gives more precise cardinality estimates for union, intersection, or relative complements of two sets that are both represented by HyperLogLog sketches compared to the conventional technique using the inclusion-exclusion principle. All the new methods are demonstrated and verified by extensive simulations.

[1]  William H. Press,et al.  Numerical Recipes 3rd Edition: The Art of Scientific Computing , 2007 .

[2]  Heng Tao Shen,et al.  Hashing for Similarity Search: A Survey , 2014, ArXiv.

[3]  Kyu-Young Whang,et al.  A linear-time probabilistic counting algorithm for database applications , 1990, TODS.

[4]  G. Casella,et al.  Statistical Inference , 2003, Encyclopedia of Social Network Analysis and Mining.

[5]  David P. Woodruff,et al.  Tight lower bounds for the distinct elements problem , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[6]  P. Flajolet,et al.  Loglog counting of large cardinalities , 2003 .

[7]  Davis E. King,et al.  Dlib-ml: A Machine Learning Toolkit , 2009, J. Mach. Learn. Res..

[8]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[9]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[10]  Philippe Jacquet,et al.  Analytical Depoissonization and its Applications , 1998, Theor. Comput. Sci..

[11]  Ping Li,et al.  Theory and applications of b-bit minwise hashing , 2011, Commun. ACM.

[12]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[13]  Alexander Hall,et al.  HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm , 2013, EDBT '13.

[14]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[15]  David P. Woodruff,et al.  An optimal algorithm for the distinct elements problem , 2010, PODS '10.

[16]  P. Flajolet,et al.  HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm , 2007 .

[17]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[18]  Larry Shepp,et al.  Distinct Counting With a Self-Learning Bitmap , 2011 .

[19]  Daniel Ting,et al.  Streamed approximate counting of distinct elements: beating optimal batch methods , 2014, KDD.

[20]  Peter Clifford,et al.  A Statistical Analysis of Probabilistic Counting Algorithms , 2008, 0801.3552.

[21]  Peter J. Haas,et al.  On synopses for distinct-value estimation under multiset operations , 2007, SIGMOD '07.

[22]  Edith Cohen,et al.  All-Distances Sketches, Revisited: HIP Estimators for Massive Graphs Analysis , 2013, IEEE Transactions on Knowledge and Data Engineering.

[23]  Daniel Ting,et al.  Towards Optimal Cardinality Estimation of Unions and Intersections with Sketches , 2016, KDD.

[24]  Anirban Dasgupta,et al.  A Framework for Estimating Stream Expression Cardinalities , 2015, ICDT.

[25]  Reuven Cohen,et al.  A Minimal Variance Estimator for the Cardinality of Big Data Set Intersection , 2016, KDD.

[26]  Ping Li,et al.  One Permutation Hashing , 2012, NIPS.

[27]  Amr El Abbadi,et al.  Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic , 2008, EDBT '08.