Practical minimal perfect hash functions for large databases

We describe the first practical algorithms for finding minimal perfect hash functions that have been used to access very large databases (i.e., having over 1 million keys). This method extends earlier work wherein an 0(n-cubed) algorithm was devised, building upon prior work by Sager that described an 0(n-to the fourth) algorithm. Our first linear expected time algorithm makes use of three key insights: applying randomness whereever possible, ordering our search for hash functions based on the degree of the vertices in a graph that represents word dependencies, and viewing hash value assignment in terms of adding circular patterns of related words to a partially filled disk. Our second algorithm builds functions that are slightly more complex, but does not build a word dependency graph and so approaches the theoretical lower bound on function specification size. While ultimately applicable to a wide variety of data and file access needs, these algorithms have already proven useful in aiding our work in improving the performance of CD-ROM systems and our construction of a Large External Network Database (LEND) for semantic networks and hypertext/hypermedia collections. Virginia Disc One includes a demonstration of a minimal perfect hash function running on a PC to access a 130,198 word list on that CD-ROM. Several other microcomputer, minicomputer, and parallel processor versions and applications of our algorithm have also been developed. Tests including those wiht a French word list of 420,878 entries and a library catalog key set with over 3.8 million keys have shown that our methods work with very large databases.

[1]  Edward A. Fox,et al.  Implementation of a Perfect Hash Function Scheme , 1989 .

[2]  Chin-Chen Chang Letter-oriented reciprocal hashing scheme , 1986, Inf. Sci..

[3]  Renzo Sprugnoli,et al.  Perfect hashing functions , 1977, Commun. ACM.

[4]  Edward A. Fox,et al.  Building a Large Thesaurus for Information Retrieval , 1988, ANLP.

[5]  Nick Cercone,et al.  Minimal and almost minimal perfect hash function search with application to natural language lexicon design , 1983 .

[6]  C. C. Chang The study of an ordered minimal perfect hashing scheme , 1984, CACM.

[7]  Edward A. Fox,et al.  A more cost effective algorithm for finding perfect hash functions , 1989, CSC '89.

[8]  E. Palmer Graphical evolution: an introduction to the theory of random graphs , 1985 .

[9]  S. K. Park,et al.  Random number generators: good ones are hard to find , 1988, CACM.

[10]  Alan L. Tharp,et al.  Near‐perfect hashing of large word sets , 1989, Softw. Pract. Exp..

[11]  Harry G. Mairson The program complexity of searching a table , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[12]  Amjad M. Daoud Efficient data structures for information retrieval , 1993 .

[13]  M. V. Ramakrishna,et al.  File organization using composite perfect hashing , 1989, ACM Trans. Database Syst..

[14]  Donald E. Knuth,et al.  The art of computer programming: sorting and searching (volume 3) , 1973 .

[15]  Jeanette P. Schmidt,et al.  On aspects of university and performance for closed hashing , 1989, STOC '89.

[16]  Gaston H. Gonnet,et al.  External hashing with limited internal storage , 1988 .

[17]  Edward A. Fox,et al.  Order preserving minimal perfect hash functions and information retrieval , 1989, SIGIR '90.

[18]  János Komlós,et al.  Storing a sparse table with O(1) worst case access time , 1982, 23rd Annual Symposium on Foundations of Computer Science (sfcs 1982).

[19]  Richard J. Cichelli Minimal perfect hash functions made simple , 1980, CACM.

[20]  Peter K. Pearson,et al.  Fast hashing of variable-length text strings , 1990, CACM.

[21]  Thomas J. Sager A polynomial time generator for minimal perfect hash functions , 1985, CACM.

[22]  Kurt Mehlhorn,et al.  On the program size of perfect and universal hash functions , 1982, 23rd Annual Symposium on Foundations of Computer Science (sfcs 1982).

[23]  Feller William,et al.  An Introduction To Probability Theory And Its Applications , 1950 .

[24]  Edward A. Fox,et al.  An O(n log n) Algorithm for Finding Minimal Perfect Hash Functions , 1989 .

[25]  Gerhard Jaeschke Reciprocal hashing: a method for generating minimal perfect hashing functions , 1981, CACM.

[26]  Larry Carter,et al.  Universal Classes of Hash Functions , 1979, J. Comput. Syst. Sci..

[27]  Richard J. Enbody,et al.  Dynamic hashing schemes , 1988, CSUR.

[28]  Collins Dictionaries Collins English Dictionary , 1991 .

[29]  R. Nigel Horspool,et al.  Practical Perfect Hashing , 1985, Comput. J..

[30]  Edward A. Fox,et al.  Building the CODER Lexicon: The Collins English Dictionary and Its Adverb Definitions , 1986 .

[31]  Edward A. Fox Optical disks and CD-ROM: publishing and access , 1988 .