Min-wise independent permutations (extended abstract)

We define and study the notion of min-wise independent families of permutations. We say that F ⊆ Sn is min-wise independent if for any set X ⊆ [n] and any x ∈ X , when π is chosen at random in F we have Pr ( min{π(X)} = π(x) ) = 1 |X | . In other words we require that all the elements of any fixed set X have an equal chance to become the minimum element of the image of X under π. Our research was motivated by the fact that such a family (under some relaxations) is essential to the algorithm used in practice by the AltaVista web index software to detect and filter near-duplicate documents. However, in the course of ∗Digital SRC, 130 Lytton Avenue, Palo Alto, CA 94301, USA. E-mail: broder@pa.dec.com. †Computer Science Department, Stanford University, CA 94305, USA. E-mail: moses@cs.stanford.edu. Part of this work was done while this author was a summer intern at Digital SRC. Supported by the Pierre and Christine Lamond Fellowship and in part by an ARO MURI Grant DAAH04-96-1-0007 and NSF Award CCR-9357849, with matching funds from IBM, Schlumberger Foundation, Shell Foundation, and Xerox Corporation. ‡Department of Mathematical Sciences, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, USA. Part of this work was done while this author was visiting Digital SRC. Supported in part by NSF grant CCR9530974. E-mail: af1p@andrew.cmu.edu §Digital SRC, 130 Lytton Avenue, Palo Alto, CA 94301, USA. E-mail: michaelm@pa.dec.com. our investigation we have discovered interesting and challenging theoretical questions related to this concept – we present the solution to some of them and we list the rest as open problems.

[1]  E. T. An Introduction to the Theory of Numbers , 1946, Nature.

[2]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[3]  T. Apostol Introduction to analytic number theory , 1976 .

[4]  Donald E. Knuth,et al.  The art of computer programming: V.1.: Fundamental algorithms , 1997 .

[5]  Larry Carter,et al.  Universal Classes of Hash Functions , 1979, J. Comput. Syst. Sci..

[6]  B. Bollobás Combinatorics: Set Systems, Hypergraphs, Families of Vectors and Combinatorial Probability , 1986 .

[7]  Friedhelm Meyer auf der Heide,et al.  Dynamic perfect hashing: upper and lower bounds , 1988, [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science.

[8]  Andrei Z. Broder,et al.  Multilevel adaptive hashing , 1990, SODA '90.

[9]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[10]  Noga Alon,et al.  The Probabilistic Method , 2015, Fundamentals of Ramsey Theory.

[11]  Edith Cohen,et al.  Estimating the size of the transitive closure in linear time , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[12]  Cüneyt M. Özveren,et al.  GIGAswitch System: A High-performance Packet-switching Platform , 1994, Digit. Tech. J..

[13]  Deborah S. Ray,et al.  The AltaVista Revolution: How to Find Anything on the Internet , 1996 .

[14]  David R. Karger,et al.  Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web , 1997, STOC '97.

[15]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[16]  Peter Bro Miltersen,et al.  Is linear hashing good? , 1997, STOC '97.

[17]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[18]  Andrei Z. Broder,et al.  A Derandomization Using Min-Wise Independent Permutations , 1998, RANDOM.

[19]  Eugenia Kovatcheva How to Find Anything on The Internet , 1999 .

[20]  Piotr Indyk,et al.  A small approximately min-wise independent family of hash functions , 1999, SODA '99.

[21]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[22]  T. Shinozaki,et al.  Constructing an Optimal Family of Min-Wise Independent Permutations , 2000 .