论文信息 - Even Better Framework for min-wise Based Algorithms

Even Better Framework for min-wise Based Algorithms

In a recent paper from SODA11 \cite{kminwise} the authors introduced a general framework for exponential time improvement of \minwise based algorithms by defining and constructing almost \kmin independent family of hash functions. Here we take it a step forward and reduce the space and the independent needed for representing the functions, by defining and constructing a \dkmin independent family of hash functions. Surprisingly, for most cases only 8-wise independent is needed for exponential time and space improvement. Moreover, we bypass the $O(\log{\frac{1}{\epsilon}})$ independent lower bound for approximately \minwise functions \cite{patrascu10kwise-lb}, as we use alternative definition. In addition, as the independent's degree is a small constant it can be implemented efficiently. Informally, under this definition, all subsets of size $d$ of any fixed set $X$ have an equal probability to have hash values among the minimal $k$ values in $X$, where the probability is over the random choice of hash function from the family. This property measures the randomness of the family, as choosing a truly random function, obviously, satisfies the definition for $d=k=|X|$. We define and give an efficient time and space construction of approximately \dkmin independent family of hash functions. The degree of independent required is optimal, i.e. only $O(d)$ for $2 \le d < k=O(\frac{d}{\epsilon^2})$, where $\epsilon \in (0,1)$ is the desired error bound. This construction can be used to improve many \minwise based algorithms, such as \cite{sizeEstimationFramework,Datar02estimatingrarity,NearDuplicate,SimilaritySearch,DBLP:conf/podc/CohenK07}, as will be discussed here. To our knowledge such definitions, for hash functions, were never studied and no construction was given before.

Ely Porat | Guy Feigenblat | Ariel Shiftan

[1] Gurmeet Singh Manku,et al. Detecting near-duplicates for web crawling , 2007, WWW '07.

[2] Grace Hui Yang,et al. Near-duplicate detection by instance-level constrained clustering , 2006, SIGIR.

[3] Alan M. Frieze,et al. Min-wise independent permutations (extended abstract) , 1998, STOC '98.

[4] Monika Henzinger,et al. Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[5] Ely Porat,et al. Sketching Techniques for Collaborative Filtering , 2009, IJCAI.

[6] Aravind Srinivasan,et al. Low Discrepancy Sets Yield Approximate Min-Wise Independent Permutation Families , 1999, RANDOM-APPROX.

[7] S. Muthukrishnan,et al. Estimating Rarity and Similarity over Data Stream Windows , 2002, ESA.

[8] Edith Cohen,et al. Size-Estimation Framework with Applications to Transitive Closure and Reachability , 1997, J. Comput. Syst. Sci..

[9] Srikanta Tirthapura,et al. Estimating simple functions on the union of data streams , 2001, SPAA '01.

[10] Geoffrey Zweig,et al. Syntactic Clustering of the Web , 1997, Comput. Networks.

[11] Dan Klein,et al. Evaluating strategies for similarity search on the web , 2002, WWW '02.

[12] Andrei Z. Broder,et al. Identifying and Filtering Near-Duplicate Documents , 2000, CPM.

[13] Edith Cohen,et al. Tighter estimation using bottom k sketches , 2008, Proc. VLDB Endow..

[14] Andrei Z. Broder,et al. On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[15] Graham Cormode,et al. What's new: finding significant differences in network data streams , 2004, INFOCOM 2004.

[16] Ketan Mulmuley. Randomized geometric algorithms and pseudo-random generators , 1992, Proceedings., 33rd Annual Symposium on Foundations of Computer Science.

[17] Edith Cohen,et al. Summarizing data using bottom-k sketches , 2007, PODC '07.

[18] Abhinandan Das,et al. Google news personalization: scalable online collaborative filtering , 2007, WWW '07.

[19] Ely Porat,et al. Sketching Algorithms for Approximating Rank Correlations in Collaborative Filtering Systems , 2009, SPIRE.

[20] Piotr Indyk,et al. A small approximately min-wise independent family of hash functions , 1999, SODA '99.

[21] Rajeev Rastogi,et al. Processing set expressions over continuous update streams , 2003, SIGMOD '03.

[22] Ely Porat,et al. Exponential time improvement for min-wise based algorithms , 2011, SODA '11.