Even Better Framework for min-wise Based Algorithms

In a recent paper from SODA11 \cite{kminwise} the authors introduced a general framework for exponential time improvement of \minwise based algorithms by defining and constructing almost \kmin independent family of hash functions. Here we take it a step forward and reduce the space and the independent needed for representing the functions, by defining and constructing a \dkmin independent family of hash functions. Surprisingly, for most cases only 8-wise independent is needed for exponential time and space improvement. Moreover, we bypass the $O(\log{\frac{1}{\epsilon}})$ independent lower bound for approximately \minwise functions \cite{patrascu10kwise-lb}, as we use alternative definition. In addition, as the independent's degree is a small constant it can be implemented efficiently. Informally, under this definition, all subsets of size $d$ of any fixed set $X$ have an equal probability to have hash values among the minimal $k$ values in $X$, where the probability is over the random choice of hash function from the family. This property measures the randomness of the family, as choosing a truly random function, obviously, satisfies the definition for $d=k=|X|$. We define and give an efficient time and space construction of approximately \dkmin independent family of hash functions. The degree of independent required is optimal, i.e. only $O(d)$ for $2 \le d < k=O(\frac{d}{\epsilon^2})$, where $\epsilon \in (0,1)$ is the desired error bound. This construction can be used to improve many \minwise based algorithms, such as \cite{sizeEstimationFramework,Datar02estimatingrarity,NearDuplicate,SimilaritySearch,DBLP:conf/podc/CohenK07}, as will be discussed here. To our knowledge such definitions, for hash functions, were never studied and no construction was given before.

[1]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[2]  Grace Hui Yang,et al.  Near-duplicate detection by instance-level constrained clustering , 2006, SIGIR.

[3]  Alan M. Frieze,et al.  Min-wise independent permutations (extended abstract) , 1998, STOC '98.

[4]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[5]  Ely Porat,et al.  Sketching Techniques for Collaborative Filtering , 2009, IJCAI.

[6]  Aravind Srinivasan,et al.  Low Discrepancy Sets Yield Approximate Min-Wise Independent Permutation Families , 1999, RANDOM-APPROX.

[7]  S. Muthukrishnan,et al.  Estimating Rarity and Similarity over Data Stream Windows , 2002, ESA.

[8]  Edith Cohen,et al.  Size-Estimation Framework with Applications to Transitive Closure and Reachability , 1997, J. Comput. Syst. Sci..

[9]  Srikanta Tirthapura,et al.  Estimating simple functions on the union of data streams , 2001, SPAA '01.

[10]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[11]  Dan Klein,et al.  Evaluating strategies for similarity search on the web , 2002, WWW '02.

[12]  Andrei Z. Broder,et al.  Identifying and Filtering Near-Duplicate Documents , 2000, CPM.

[13]  Edith Cohen,et al.  Tighter estimation using bottom k sketches , 2008, Proc. VLDB Endow..

[14]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[15]  Graham Cormode,et al.  What's new: finding significant differences in network data streams , 2004, INFOCOM 2004.

[16]  Ketan Mulmuley Randomized geometric algorithms and pseudo-random generators , 1992, Proceedings., 33rd Annual Symposium on Foundations of Computer Science.

[17]  Edith Cohen,et al.  Summarizing data using bottom-k sketches , 2007, PODC '07.

[18]  Abhinandan Das,et al.  Google news personalization: scalable online collaborative filtering , 2007, WWW '07.

[19]  Ely Porat,et al.  Sketching Algorithms for Approximating Rank Correlations in Collaborative Filtering Systems , 2009, SPIRE.

[20]  Piotr Indyk,et al.  A small approximately min-wise independent family of hash functions , 1999, SODA '99.

[21]  Rajeev Rastogi,et al.  Processing set expressions over continuous update streams , 2003, SIGMOD '03.

[22]  Ely Porat,et al.  Exponential time improvement for min-wise based algorithms , 2011, SODA '11.