A minimal perfect function maps a static set of n keys on to the range of integers {0,1,2,...,n - 1}. We present a scalable high performance algorithm based on random graphs for constructing minimal perfect hash functions (MPHFs). For a set of n keys, our algorithm outputs a description of h in expected time O(n). The evaluation of h(x) requires three memory accesses for any key x and the description of h takes up 0.89n bytes (7.13n bits). This is the best (most space efficient) known result to date. Using a simple heuristic and Huffman coding, the space requirement is further reduced to 0.79n bytes (6.86n bits). We present a high performance architecture that is easy to parallelize and scales well to very large data sets encountered in internet search applications. Experimental results on a one billion URL dataset obtained from Live Search crawl data, show that the proposed algorithm (a)finds an MPHF for one billion URLs in less than 4 minutes, and (b) requires only 6.86 bits/key for the description of h.
[1]
Donald E. Knuth,et al.
The Art of Computer Programming: Volume 3: Sorting and Searching
,
1998
.
[2]
Donald E. Knuth,et al.
The art of computer programming: sorting and searching (volume 3)
,
1973
.
[3]
George Havas,et al.
An Optimal Algorithm for Generating Minimal Perfect Hash Functions
,
1992,
Inf. Process. Lett..
[4]
Yoshiharu Kohayakawa,et al.
A Practical Minimal Perfect Hashing Method
,
2005,
WEA.
[5]
Donald Ervin Knuth,et al.
The Art of Computer Programming
,
1968
.