Exponential Space Improvement for minwise Based Algorithms

In this paper we introduce a general framework that exponentially improves the space, the degree of independence, and the time needed by min-wise based algorithms. The authors, in SODA 2011, we introduced an exponential time improvement for min-wise based algorithms by defining and constructing an almost k-min-wise independent family of hash functions. Here we develop an alternative approach that achieves both exponential time and exponential space improvement. The new approach relaxes the need for approximately min-wise hash functions, hence gets around the Omega(log(1/epsilon)) independence lower bound in [Patrascu 2010]. This is done by defining and constructing a d-k-min-wise independent family of hash functions. Surprisingly, for most cases only 8-wise independence is needed for the additional improvement. Moreover, as the degree of independence is a small constant, our function can be implemented efficiently. Informally, under this definition, all subsets of size d of any fixed set X have an equal probability to have hash values among the minimal k values in X, where the probability is over the random choice of hash function from the family. This property measures the randomness of the family, as choosing a truly random function, obviously, satisfies the definition for d=k=|X|. We define and give an efficient time and space construction of approximately d-k-min-wise independent family of hash functions for the case where d=2, as this is sufficient for the additional exponential improvement. We discuss how this construction can be used to improve many min-wise based algorithms. To our knowledge such definitions, for hash functions, were never studied and no construction was given before. As an example we show how to apply it for similarity and rarity estimation over data streams. Other min-wise based algorithms, can be adjusted in the same way.

[1]  Dan Klein,et al.  Evaluating strategies for similarity search on the web , 2002, WWW '02.

[2]  Andrei Z. Broder,et al.  Identifying and Filtering Near-Duplicate Documents , 2000, CPM.

[3]  Edith Cohen,et al.  Tighter estimation using bottom k sketches , 2008, Proc. VLDB Endow..

[4]  Alan M. Frieze,et al.  Min-wise independent permutations (extended abstract) , 1998, STOC '98.

[5]  Srikanta Tirthapura,et al.  Estimating simple functions on the union of data streams , 2001, SPAA '01.

[6]  Edith Cohen,et al.  Finding interesting associations without support pruning , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[7]  Ely Porat,et al.  Exponential time improvement for min-wise based algorithms , 2011, SODA '11.

[8]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[9]  Ely Porat,et al.  Sketching Algorithms for Approximating Rank Correlations in Collaborative Filtering Systems , 2009, SPIRE.

[10]  Ketan Mulmuley Randomized geometric algorithms and pseudo-random generators , 1992, Proceedings., 33rd Annual Symposium on Foundations of Computer Science.

[11]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[12]  Grace Hui Yang,et al.  Near-duplicate detection by instance-level constrained clustering , 2006, SIGIR.

[13]  Graham Cormode,et al.  What's new: finding significant differences in network data streams , 2004, INFOCOM 2004.

[14]  Edith Cohen,et al.  Summarizing data using bottom-k sketches , 2007, PODC '07.

[15]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[16]  Aravind Srinivasan,et al.  Low Discrepancy Sets Yield Approximate Min-Wise Independent Permutation Families , 1999, RANDOM-APPROX.

[17]  Piotr Indyk,et al.  A small approximately min-wise independent family of hash functions , 1999, SODA '99.

[18]  Ely Porat,et al.  Sketching Techniques for Collaborative Filtering , 2009, IJCAI.

[19]  Abhinandan Das,et al.  Google news personalization: scalable online collaborative filtering , 2007, WWW '07.

[20]  Mikkel Thorup,et al.  The power of simple tabulation hashing , 2010, STOC.

[21]  S. Muthukrishnan,et al.  Estimating Rarity and Similarity over Data Stream Windows , 2002, ESA.

[22]  Rajeev Rastogi,et al.  Processing set expressions over continuous update streams , 2003, SIGMOD '03.

[23]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[24]  Edith Cohen,et al.  Size-Estimation Framework with Applications to Transitive Closure and Reachability , 1997, J. Comput. Syst. Sci..