Bottom-k and priority sampling, set similarity and subset sums with minimal independence

We consider bottom-k sampling for a set X, picking a sample Sk(X) consisting of the k elements that are smallest according to a given hash function h. With this sample we can estimate the relative size f=|Y|/|X| of any subset Y as |Sk(X) intersect Y|/k. A standard application is the estimation of the Jaccard similarity f=|A intersect B|/|A union B| between sets A and B. Given the bottom-k samples from A and B, we construct the bottom-k sample of their union as Sk(A union B)=Sk(Sk(A) union Sk(B)), and then the similarity is estimated as |Sk(A union B) intersect Sk(A) intersect Sk(B)|/k. We show here that even if the hash function is only 2-independent, the expected relative error is O(1√(fk)). For fk=Omega(1) this is within a constant factor of the expected relative error with truly random hashing. For comparison, consider the classic approach of kxmin-wise where we use k hash independent functions h1,...,hk, storing the smallest element with each hash function. For kxmin-wise there is an at least constant bias with constant independence, and it is not reduced with larger k. Recently Feigenblat et al. showed that bottom-k circumvents the bias if the hash function is 8-independent and k is sufficiently large. We get down to 2-independence for any k. Our result is based on a simply union bound, transferring generic concentration bounds for the hashing scheme to the bottom-k sample, e.g., getting stronger probability error bounds with higher independence. For weighted sets, we consider priority sampling which adapts efficiently to the concrete input weights, e.g., benefiting strongly from heavy-tailed input. This time, the analysis is much more involved, but again we show that generic concentration bounds can be applied.

[1]  Mario Szegedy,et al.  The DLT priority sampling is essentially optimal , 2006, STOC '06.

[2]  Stefan Savage,et al.  Inside the Slammer Worm , 2003, IEEE Secur. Priv..

[3]  Carsten Lund,et al.  Learn more, sample less: control of volume and variance in network measurement , 2005, IEEE Transactions on Information Theory.

[4]  Mikkel Thorup,et al.  Tabulation-Based 5-Independent Hashing with Applications to Linear Probing and Second Moment Estimation , 2012, SIAM J. Comput..

[5]  Yossi Matias,et al.  Polynomial Hash Functions Are Reliable (Extended Abstract) , 1992, ICALP.

[6]  Andrei Z. Broder,et al.  Identifying and Filtering Near-Duplicate Documents , 2000, CPM.

[7]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[8]  Larry Carter,et al.  New classes and applications of hash functions , 1979, 20th Annual Symposium on Foundations of Computer Science (sfcs 1979).

[9]  Ely Porat,et al.  Exponential Space Improvement for minwise Based Algorithms , 2012, FSTTCS.

[10]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[11]  Anna Pagh,et al.  Linear probing with constant independence , 2006, STOC '07.

[12]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[13]  Ely Porat,et al.  Sketching Techniques for Collaborative Filtering , 2009, IJCAI.

[14]  Mikkel Thorup,et al.  Twisted Tabulation Hashing , 2013, SODA.

[15]  Martin Dietzfelbinger,et al.  Universal Hashing and k-Wise Independent Random Variables via Integer Arithmetic without Primes , 1996, STACS.

[16]  S. Muthukrishnan,et al.  Estimating Rarity and Similarity over Data Stream Windows , 2002, ESA.

[17]  Russ Bubley,et al.  Randomized algorithms , 1995, CSUR.

[18]  Carsten Lund,et al.  Priority sampling for estimation of arbitrary subset sums , 2007, JACM.

[19]  Carl-Erik Särndal,et al.  Model Assisted Survey Sampling , 1997 .

[20]  James N. Rosenau,et al.  To learn more , 2004, IEEE Potentials.

[21]  Kai-Min Chung,et al.  Why simple hash functions work: exploiting the entropy in a data stream , 2008, SODA '08.

[22]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[23]  Ely Porat,et al.  Fast Pseudo-Random Fingerprints , 2010, ArXiv.

[24]  Ely Porat,et al.  Even Better Framework for min-wise Based Algorithms , 2011, ArXiv.

[25]  Mikkel Thorup,et al.  Confidence intervals for priority sampling , 2006, SIGMETRICS '06/Performance '06.

[27]  Edith Cohen,et al.  Summarizing data using bottom-k sketches , 2007, PODC '07.

[28]  Edith Cohen,et al.  Finding interesting associations without support pruning , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[29]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[30]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[31]  Ely Porat,et al.  Sketching Algorithms for Approximating Rank Correlations in Collaborative Filtering Systems , 2009, SPIRE.

[32]  Mark A. McComb A Practical Guide to Heavy Tails , 2000, Technometrics.

[33]  Mikkel Thorup,et al.  On the k-Independence Required by Linear Probing and Minwise Independence , 2010, TALG.

[34]  Carsten Lund,et al.  Charging from sampled network usage , 2001, IMW '01.

[35]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[36]  Larry Rudolph,et al.  A Complexity Theory of Efficient Parallel Algorithms , 1990, Theor. Comput. Sci..

[37]  Grace Hui Yang,et al.  Near-duplicate detection by instance-level constrained clustering , 2006, SIGIR.

[38]  Aravind Srinivasan,et al.  Chernoff-Hoeffding bounds for applications with limited independence , 1995, SODA '93.

[39]  Carsten Lund,et al.  Variance optimal sampling based estimation of subset sums , 2008, ArXiv.

[40]  Luca Trevisan,et al.  Counting Distinct Elements in a Data Stream , 2002, RANDOM.