String hashing for linear probing

Linear probing is one of the most popular implementations of dynamic hash tables storing all keys in a single array. When we get a key, we first hash it to a location. Next we probe consecutive locations until the key or an empty location is found. At STOC'07, Pagh et al. presented data sets where the standard implementation of 2-universal hashing leads to an expected number of Ω(log n) probes. They also showed that with 5-universal hashing, the expected number of probes is constant. Unfortunately, we do not have 5-universal hashing for, say, variable length strings. When we want to do such complex hashing from a complex domain, the generic standard solution is that we first do collision free hashing (w.h.p.) into a simpler intermediate domain, and second do the complicated hash function on this intermediate domain. Our contribution is that for an expected constant number of linear probes, it is suffices that each key has O(1) expected collisions with the first hash function, as long as the second hash function is 5-universal. This means that the intermediate domain can be n times smaller, and such a smaller intermediate domain typically means that the overall hash function can be made simpler and at least twice as fast. The same doubling of hashing speed for O(1) expected probes follows for most domains bigger than 32-bit integers, e.g., 64-bit integers and fixed length strings. In addition, we study how the overhead from linear probing diminishes as the array gets larger, and what happens if strings are stored directly as intervals of the array. These cases were not considered by Pagh et al.

[1]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[2]  Larry Carter,et al.  Universal Classes of Hash Functions , 1979, J. Comput. Syst. Sci..

[3]  Larry Carter,et al.  New Hash Functions and Their Use in Authentication and Set Equality , 1981, J. Comput. Syst. Sci..

[4]  Bjarne Stroustrup,et al.  C++ Programming Language , 1986, IEEE Softw..

[5]  Jeanette P. Schmidt,et al.  The analysis of closed hashing under limited randomness , 1990, STOC '90.

[6]  Bjarne Stroustrup,et al.  The C++ programming language (2nd ed.) , 1991 .

[7]  Yossi Matias,et al.  Polynomial Hash Functions Are Reliable (Extended Abstract) , 1992, ICALP.

[8]  A. Siegel,et al.  Closed Hashing is Computable and Optimally Randomizable with Universal Hash Functions , 1995 .

[9]  Martin Dietzfelbinger,et al.  Universal Hashing and k-Wise Independent Random Variables via Integer Arithmetic without Primes , 1996, STACS.

[10]  Martti Penttonen,et al.  A Reliable Randomized Algorithm for the Closest-Pair Problem , 1997, J. Algorithms.

[11]  Charles U. Martel,et al.  Graph and Hashing Algorithms for Modern Architectures: Design and Performance , 1998, WAE.

[12]  Hugo Krawczyk,et al.  UMAC: Fast and Secure Message Authentication , 1999, CRYPTO.

[13]  Bjarne Stroustrup,et al.  The C++ Programming Language: Special Edition , 2000 .

[14]  David Thomas,et al.  The Art in Computer Programming , 2001 .

[15]  Mikkel Thorup,et al.  Tabulation based 4-universal hashing with applications to second moment estimation , 2004, SODA '04.

[16]  Rasmus Pagh,et al.  Cuckoo Hashing , 2001, Encyclopedia of Algorithms.

[17]  Gregory L. Heileman,et al.  How Caching Affects Hashing , 2005, ALENEX/ANALCO.

[18]  Anna Pagh,et al.  Linear probing with constant independence , 2006, STOC '07.

[19]  Kai-Min Chung,et al.  Why simple hash functions work: exploiting the entropy in a data stream , 2008, SODA '08.