Min-Wise Independent Permutations

We define and study the notion of min-wise independent families of permutations. We say that F?Sn (the symmetric group) is min-wise independent if for any set X?n and any x?X, when ? is chosen at random in F we havePr(min{?(X)}=?(x))=1|X| . In other words we require that all the elements of any fixed set X have an equal chance to become the minimum element of the image of X under ?. Our research was motivated by the fact that such a family (under some relaxations) is essential to the algorithm used in practice by the AltaVista web index software to detect and filter near-duplicate documents. However, in the course of our investigation we have discovered interesting and challenging theoretical questions related to this concept?we present the solutions to some of them and we list the rest as open problems.

[1]  E. Wright,et al.  An Introduction to the Theory of Numbers , 1939 .

[2]  Donald E. Knuth,et al.  The Art of Computer Programming, Volume I: Fundamental Algorithms, 2nd Edition , 1997 .

[3]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[4]  T. Apostol Introduction to analytic number theory , 1976 .

[5]  Larry Carter,et al.  Universal Classes of Hash Functions , 1979, J. Comput. Syst. Sci..

[6]  J. Körner Fredman-Kolmo´s bounds and information theory , 1986 .

[7]  B. Bollobás Combinatorics: Set Systems, Hypergraphs, Families of Vectors and Combinatorial Probability , 1986 .

[8]  Andrei Z. Broder,et al.  Multilevel adaptive hashing , 1990, SODA '90.

[9]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[10]  Noga Alon,et al.  Simple Construction of Almost k-wise Independent Random Variables , 1992, Random Struct. Algorithms.

[11]  Noga Alon,et al.  The Probabilistic Method , 2015, Fundamentals of Ramsey Theory.

[12]  Tim Berners-Lee,et al.  The world-wide web : Internet technology , 1994 .

[13]  Friedhelm Meyer auf der Heide,et al.  Dynamic Perfect Hashing: Upper and Lower Bounds , 1994, SIAM J. Comput..

[14]  Edith Cohen,et al.  Estimating the size of the transitive closure in linear time , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[15]  Tim Berners-Lee,et al.  The World-Wide Web , 1994, CACM.

[16]  Cüneyt M. Özveren,et al.  GIGAswitch System: A High-performance Packet-switching Platform , 1994, Digit. Tech. J..

[17]  David R. Karger,et al.  Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web , 1997, STOC '97.

[18]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[19]  Peter Bro Miltersen,et al.  Is linear hashing good? , 1997, STOC '97.

[20]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[21]  Andrei Z. Broder,et al.  A Derandomization Using Min-Wise Independent Permutations , 1998, RANDOM.

[22]  Piotr Indyk,et al.  A small approximately min-wise independent family of hash functions , 1999, SODA '99.

[23]  T. Shinozaki,et al.  Constructing an Optimal Family of Min-Wise Independent Permutations , 2000 .