HyperLogLog: Exponentially Bad in Adversarial Settings

Computing the count of distinct elements in large data sets is a common task but naive approaches are memory-expensive. The HyperLogLog (HLL) algorithm (Flajolet et al., 2007) estimates a data set’s cardinality while using significantly less memory than a naive approach, at the cost of some accuracy. This trade-off makes the HLL algorithm very attractive for a wide range of applications such as database management and network monitoring, where an exact count may not be needed. The HLL algorithm and variants of it are implemented in systems such as Redis and Google Big Query. Recently, the HLL algorithm has started to be proposed for use in scenarios where the inputs may be adversarially generated, for example counting social network users or detection of network scanning attacks. This prompts an examination of the performance of the HLL algorithm in the face of adversarial inputs. We show that in such a setting, the HLL algorithm’s estimate of cardinality can be exponentially bad: when an adversary has access to the internals of the HLL algorithm and has some flexibility in choosing what inputs will be recorded, it can manipulate the cardinality estimate to be exponentially smaller than the true cardinality. We study both the original HLL algorithm and a more modern version of it (Ertl, 2017) that is used in Redis. We present experimental results confirming our theoretical analysis. Finally, we consider attack prevention: we show how to modify HLL in a simple way that provably prevents cardinality estimate manipulation attacks.

[1]  Jean-Philippe Aumasson,et al.  SipHash: A Fast Short-Input PRF , 2012, INDOCRYPT.

[2]  Alexander Hall,et al.  HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm , 2013, EDBT '13.

[3]  Lajos Rónyai,et al.  Factoring polynomials over finite fields , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[4]  Dan S. Wallach,et al.  Denial of Service via Algorithmic Complexity Attacks , 2003, USENIX Security Symposium.

[5]  Kyu-Young Whang,et al.  A linear-time probabilistic counting algorithm for database applications , 1990, TODS.

[6]  Haim Kaplan,et al.  Adversarially Robust Streaming Algorithms via Differential Privacy , 2020, NeurIPS.

[7]  Raja Chiky,et al.  How can sliding HyperLogLog and EWMA detect port scan attacks in IP traffic? , 2014, EURASIP J. Inf. Secur..

[8]  David P. Woodruff,et al.  A Framework for Adversarially Robust Streaming Algorithms , 2020, SIGMOD Rec..

[9]  P. Flajolet,et al.  Loglog counting of large cardinalities , 2003 .

[10]  Jacob Nelson,et al.  Evaluating the Power of Flexible Packet Processing for Network Resource Allocation , 2017, NSDI.

[11]  P. Flajolet,et al.  HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm , 2007 .

[12]  Chao Li,et al.  Improved Cryptanalysis on SipHash , 2019, CANS.

[13]  Russ Bubley,et al.  Randomized algorithms , 1995, CSUR.

[14]  Georges Hébrail,et al.  Sliding HyperLogLog: Estimating Cardinality in a Data Stream over a Sliding Window , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[15]  Pedro Reviriego,et al.  Security of HyperLogLog (HLL) Cardinality Estimation: Vulnerabilities and Protection , 2020, IEEE Communications Letters.

[16]  Tight Bounds for Adversarially Robust Streams and Sliding Windows via Difference Estimators , 2020, 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS).

[17]  Haim Kaplan,et al.  Separating Adaptive Streaming from Oblivious Streaming Using the Bounded Storage Model , 2021, CRYPTO.

[18]  Cédric Lauradoux,et al.  The Power of Evil Choices in Bloom Filters , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[19]  Marcos Dias de Assunção,et al.  Apache Spark , 2019, Encyclopedia of Big Data Technologies.

[20]  Otmar Ertl,et al.  New Cardinality Estimation Methods for HyperLogLog Sketches , 2017, ArXiv.

[21]  Christopher Patton,et al.  Probabilistic Data Structures in Adversarial Environments , 2019, CCS.

[22]  Alexander Hall,et al.  Processing a Trillion Cells per Mouse Click , 2012, Proc. VLDB Endow..

[23]  Jeffrey F. Naughton,et al.  Clocked adversaries for hashing , 1993, Algorithmica.

[24]  Sujata Garera,et al.  Challenges in teaching a graduate course in applied cryptography , 2009, SGCS.

[25]  Florian Mendel,et al.  Differential Cryptanalysis of SipHash , 2014, Selected Areas in Cryptography.

[26]  Moni Naor,et al.  Bloom Filters in Adversarial Environments , 2015, CRYPTO.

[27]  David A. Basin,et al.  Cardinality Estimators do not Preserve Privacy , 2018, Proc. Priv. Enhancing Technol..