Is min-wise hashing optimal for summarizing set intersection?

Min-wise hashing is an important method for estimating the size of the intersection of sets, based on a succinct summary (a "min-hash") of each set. One application is estimation of the number of data points that satisfy the conjunction of m >= 2 simple predicates, where a min-hash is available for the set of points satisfying each predicate. This has application in query optimization and for approximate computation of COUNT aggregates. In this paper we address the question: How many bits is it necessary to allocate to each summary in order to get an estimate with (1 +/- epsilon)-relative error? The state-of-the-art technique for minimizing the encoding size, for any desired estimation error, is b-bit min-wise hashing due to Li and König (Communications of the ACM, 2011). We give new lower and upper bounds: Using information complexity arguments, we show that b-bit min-wise hashing is em space optimal for m=2 predicates in the sense that the estimator's variance is within a constant factor of the smallest possible among all summaries with the given space usage. But for conjunctions of m>2 predicates we show that the performance of b-bit min-wise hashing (and more generally any method based on "k-permutation" min-hash) deteriorates as m grows. We describe a new summary that nearly matches our lower bound for m >= 2. It asymptotically outperform all k-permutation schemes (by around a factor Omega(m/log m)), as well as methods based on subsampling (by a factor Omega(log n_max), where n_max is the maximum set size).

[1]  Philip Bille,et al.  Fast Evaluation of Union-Intersection Expressions , 2007, ISAAC.

[2]  Desh Ranjan,et al.  Balls and bins: A study in negative dependence , 1996, Random Struct. Algorithms.

[3]  David P. Woodruff,et al.  An optimal algorithm for the distinct elements problem , 2010, PODS '10.

[4]  Ke Yi,et al.  Beyond simple aggregates: indexing for summary queries , 2011, PODS.

[5]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[6]  David P. Woodruff,et al.  Tight bounds for distributed functional monitoring , 2011, STOC '12.

[7]  Edith Cohen,et al.  What You Can Do with Coordinated Samples , 2012, APPROX-RANDOM.

[8]  Ping Li,et al.  Theory and applications of b-bit minwise hashing , 2011, Commun. ACM.

[9]  T. S. Jayram Information complexity: a tutorial , 2010, PODS '10.

[10]  Silvio Lattanzi,et al.  On compressing social networks , 2009, KDD.

[11]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[12]  Ely Porat,et al.  Fast set intersection and two-patterns matching , 2009, Theor. Comput. Sci..

[13]  Amit Chakrabarti,et al.  An Optimal Lower Bound on the Communication Complexity of Gap-Hamming-Distance , 2012, SIAM J. Comput..

[14]  Ping Li,et al.  One Permutation Hashing for Efficient Search and Learning , 2012, ArXiv.

[15]  B. E. Eckbo,et al.  Appendix , 1826, Epilepsy Research.

[16]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[17]  Alan Siegel,et al.  On Universal Classes of Extremely Random Constant-Time Hash Functions , 1995, SIAM J. Comput..

[18]  David P. Woodruff Optimal space lower bounds for all frequency moments , 2004, SODA '04.

[19]  C. Papadimitriou,et al.  The complexity of massive data set computations , 2002 .

[20]  P. Flajolet,et al.  HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm , 2007 .

[21]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[22]  Noam Nisan,et al.  On Randomized One-round Communication Complexity , 1995, STOC '95.

[23]  Aravind Srinivasan,et al.  Chernoff-Hoeffding bounds for applications with limited independence , 1995, SODA '93.

[24]  Marianne Winslett,et al.  Multi-resolution bitmap indexes for scientific data , 2007, TODS.

[25]  Ping Li,et al.  b-Bit Minwise Hashing for Estimating Three-Way Similarities , 2010, NIPS.

[26]  David P. Woodruff,et al.  Tight lower bounds for the distinct elements problem , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[27]  C. SIAMJ. LOW REDUNDANCY IN STATIC DICTIONARIES WITH CONSTANT QUERY TIME , 2001 .

[28]  Florin Rusu,et al.  Sketches for size of join estimation , 2008, TODS.

[29]  Mikkel Thorup,et al.  Bottom-k and priority sampling, set similarity and subset sums with minimal independence , 2013, STOC '13.

[30]  Edith Cohen,et al.  Coordinated Weighted Sampling for Estimating Aggregates Over Multiple Weight Assignments , 2009, Proc. VLDB Endow..

[31]  Mark Braverman,et al.  Information Lower Bounds via Self-Reducibility , 2015, Theory of Computing Systems.

[32]  Ping Li,et al.  b-Bit minwise hashing , 2009, WWW '10.