Fast and scalable minimal perfect hashing for massive key sets

Minimal perfect hash functions provide space-efficient and collision-free hashing on static sets. Existing algorithms and implementations that build such functions have practical limitations on the number of input elements they can process, due to high construction time, RAM or external memory usage. We revisit a simple algorithm and show that it is highly competitive with the state of the art, especially in terms of construction time and memory usage. We provide a parallel C++ implementation called BBhash. It is capable of creating a minimal perfect hash function of 10^{10} elements in less than 7 minutes using 8 threads and 5 GB of memory, and the resulting function uses 3.7 bits/element. To the best of our knowledge, this is also the first implementation that has been successfully tested on an input of cardinality 10^{12}. Source code: https://github.com/rizkg/BBHash

[1]  Kurt Mehlhorn,et al.  On the program size of perfect and universal hash functions , 1982, 23rd Annual Symposium on Foundations of Computer Science (sfcs 1982).

[2]  J. Komlos,et al.  On the Size of Separating Systems and Families of Perfect Hash Functions , 1984 .

[3]  George Havas,et al.  Perfect Hashing , 1997, Theor. Comput. Sci..

[4]  Chin-Chen Chang,et al.  Perfect Hashing Schemes for Mining Association Rules , 2005, Comput. J..

[5]  Yi Lu,et al.  Perfect Hashing for Network Applications , 2006, 2006 IEEE International Symposium on Information Theory.

[6]  Rasmus Pagh,et al.  Simple and Space-Efficient Minimal Perfect Hash Functions , 2007, WADS.

[7]  Kai-Min Chung,et al.  Why simple hash functions work: exploiting the entropy in a data stream , 2008, SODA '08.

[8]  Martin Dietzfelbinger,et al.  Hash, Displace, and Compress , 2009, ESA.

[9]  Isaac Y. Ho,et al.  Meraculous: De Novo Genome Assembly with Short Paired-End Reads , 2011, PloS one.

[10]  Bertil Schmidt,et al.  A hybrid short read mapping accelerator , 2013, BMC Bioinformatics.

[11]  Rasmus Pagh,et al.  Practical perfect hashing in nearly optimal space , 2013, Inf. Syst..

[12]  Wei Zhou,et al.  Retrieval and Perfect Hashing Using Fingerprinting , 2014, SEA.

[13]  Giuseppe Ottaviano,et al.  Cache-Oblivious Peeling of Random Hypergraphs , 2013, 2014 Data Compression Conference.

[14]  Giuseppe Ottaviano,et al.  Fast Scalable Construction of (Minimal Perfect Hash) Functions , 2016, SEA.

[15]  Paul Medvedev,et al.  Compacting de Bruijn graphs from sequencing data quickly and in low memory , 2016, Bioinform..