Perfect hashing for external files

Perfect hashing refers to hashing with no overflows. The use of perfect hashing functions has previously been studied only for small static sets stored in main memory. All known systematic methods of finding perfect hashing functions are conceptually complicated, and most methods need exponential time to determine a perfect hashing function for a given set of keys. In this thesis we propose and analyze a perfect hashing scheme for large external files. The scheme guarantees retrieval of any record in a single disk access. Insertions and deletions are simple, and the file size may vary considerably without adversely affecting the performance. A variant of the basic scheme gives a completely dynamic file organization which also supports efficient range searching. These advantages are achieved at the cost of a small amount of supplemental internal storage and increased cost of insertions. An ordinary hashing function is used to divide the records of the file into a number of groups. The records in each group are then hashed over a number of contiguous pages of external memory by a perfect hashing function. A perfect hashing function for a group can be found by repeated random selection from a suitable class of hashing functions. We analyze the probability of a randomly chosen function (from the set of all functions) being perfect. We then describe a policy that limits the cost of finding perfect hashing functions. The resulting tradeoff between the storage utilization and the cost of finding perfect hashing functions is investigated. Results of experiments with a simple and practical class of hashing functions are reported. They indicate that the relative frequency of perfect hashing functions within the class is statistically the same as predicted by the theoretical analysis for the set of all functions. The performance of the new scheme is also compared with other hashing schemes. We conclude that the proposed perfect hashing scheme is a practical and competitive technique for organizing external files.