No Repetition: Fast Streaming with Highly Concentrated Hashing

To get estimators that work within a certain error bound with high probability, a common strategy is to design one that works with constant probability, and then boost the probability using independent repetitions. Important examples of this approach are small space algorithms for estimating the number of distinct elements in a stream, or estimating the set similarity between large sets. Using standard strongly universal hashing to process each element, we get a sketch based estimator where the probability of a too large error is, say, 1/4. By performing $r$ independent repetitions and taking the median of the estimators, the error probability falls exponentially in $r$. However, running $r$ independent experiments increases the processing time by a factor $r$. Here we make the point that if we have a hash function with strong concentration bounds, then we get the same high probability bounds without any need for repetitions. Instead of $r$ independent sketches, we have a single sketch that is $r$ times bigger, so the total space is the same. However, we only apply a single hash function, so we save a factor $r$ in time, and the overall algorithms just get simpler. Fast practical hash functions with strong concentration bounds were recently proposed by Aamand em et al. (to appear in STOC 2020). Using their hashing schemes, the algorithms thus become very fast and practical, suitable for online processing of high volume data streams.

[1]  Mikkel Thorup,et al.  Fast hashing with strong concentration bounds , 2019, STOC.

[2]  Balachander Krishnamurthy,et al.  Sketch-based change detection: methods, evaluation, and applications , 2003, IMC '03.

[3]  David P. Woodruff,et al.  An optimal algorithm for the distinct elements problem , 2010, PODS '10.

[4]  Philippe Flajolet,et al.  Loglog Counting of Large Cardinalities (Extended Abstract) , 2003, ESA.

[5]  Jaroslaw Blasiok,et al.  Optimal Streaming and Tracking Distinct Elements with High Probability , 2018, SODA.

[6]  Mikkel Thorup,et al.  Bottom-k and priority sampling, set similarity and subset sums with minimal independence , 2013, STOC '13.

[7]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[8]  Phillip B. Gibbons Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports , 2001, VLDB.

[9]  Ziv Bar-Yossef,et al.  Reductions in streaming algorithms, with an application to counting triangles in graphs , 2002, SODA '02.

[10]  Peter J. Haas,et al.  On synopses for distinct-value estimation under multiset operations , 2007, SIGMOD '07.

[11]  Mikkel Thorup,et al.  Tabulation-Based 5-Independent Hashing with Applications to Linear Probing and Second Moment Estimation , 2012, SIAM J. Comput..

[12]  Luca Trevisan,et al.  Counting Distinct Elements in a Data Stream , 2002, RANDOM.

[13]  David P. Woodruff,et al.  Tight lower bounds for the distinct elements problem , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[14]  Srikanta Tirthapura,et al.  Estimating simple functions on the union of data streams , 2001, SPAA '01.

[15]  Mikkel Thorup,et al.  Twisted Tabulation Hashing , 2013, SODA.

[16]  P. Flajolet,et al.  HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm , 2007 .

[17]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[18]  Joshua Brody,et al.  A Multi-Round Communication Lower Bound for Gap Hamming and Some Consequences , 2009, 2009 24th Annual IEEE Conference on Computational Complexity.

[19]  Russ Bubley,et al.  Randomized algorithms , 1995, CSUR.

[20]  David P. Woodruff Optimal space lower bounds for all frequency moments , 2004, SODA '04.

[21]  Larry Carter,et al.  New classes and applications of hash functions , 1979, 20th Annual Symposium on Foundations of Computer Science (sfcs 1979).

[22]  Mikkel Thorup,et al.  Simple Tabulation, Fast Expanders, Double Tabulation, and High Independence , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[23]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[24]  Martin Dietzfelbinger,et al.  Universal Hashing and k-Wise Independent Random Variables via Integer Arithmetic without Primes , 1996, STACS.

[25]  George Varghese,et al.  Bitmap algorithms for counting active flows on high speed links , 2003, IMC '03.

[26]  Edith Cohen,et al.  Size-Estimation Framework with Applications to Transitive Closure and Reachability , 1997, J. Comput. Syst. Sci..