Vector Quotient Filters: Overcoming the Time/Space Trade-Off in Filter Design

Today's filters, such as quotient, cuckoo, and Morton, have a trade-off between space and speed; even when moderately full (e.g., 50%-75% full), their performance degrades nontrivially. The result is that today's systems designers are forced to choose between speed and space usage. In this paper, we present the vector quotient filter (VQF). Locally, the VQF is based on Robin Hood hashing, like the quotient filter, but uses power-of-two-choices hashing to reduce the variance of runs, and thus offers consistent, high throughput across load factors. Power-of-two-choices hashing also makes it more amenable to concurrent updates, compared to the cuckoo filter and variants. Finally, the vector quotient filter is designed to exploit SIMD instructions so that all operations have O (1) cost, independent of the size of the filter or its load factor. We show that the vector quotient filter is 2× faster for inserts compared to the Morton filter (a cuckoo filter variant and state-of-the-art for inserts) and has similar lookup and deletion performance as the cuckoo filter (which is fastest for queries and deletes), despite having a simpler design and implementation. The vector quotient filter has minimal performance decline at high load factors, a problem that has plagued modern filters, including quotient, cuckoo, and Morton. Furthermore, we give a thread-safe version of the vector quotient filter and show that insertion throughput scales 3× with four threads compared to a single thread.

[1]  David Hung-Chang Du,et al.  BloomFlash: Bloom Filter on Flash-Based Storage , 2011, 2011 31st International Conference on Distributed Computing Systems.

[2]  Justin Chu,et al.  ABySS 2.0: resource-efficient assembly of large genomes using a Bloom filter , 2016, bioRxiv.

[3]  S. Muthukrishnan,et al.  Perfect Hashing for Strings: Formalization and Algorithms , 1996, CPM.

[4]  John D. Owens,et al.  Quotient Filters: Approximate Membership Queries on the GPU , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[5]  Björn Andersson,et al.  Classification of DNA sequences using Bloom filters , 2010, Bioinform..

[6]  Justin Chu,et al.  BioBloom tools: fast, accurate and memory-efficient host species sequence screening using bloom filters , 2014, Bioinform..

[7]  Carl Kingsford,et al.  Fast Search of Thousands of Short-Read Sequencing Experiments , 2015, Nature Biotechnology.

[8]  Rasmus Pagh,et al.  Cuckoo Hashing , 2001, Encyclopedia of Algorithms.

[9]  Cynthia A. Phillips,et al.  Timely Reporting of Heavy Hitters using External Memory , 2020, SIGMOD Conference.

[10]  Bin Fan,et al.  Cuckoo Filter: Practically Better Than Bloom , 2014, CoNEXT.

[11]  Panagiotis Manolios,et al.  Fast, All-Purpose State Storage , 2009, SPIN.

[12]  Shigang Chen,et al.  Fast Bloom Filters and Their Generalization , 2014, IEEE Transactions on Parallel and Distributed Systems.

[13]  Jason Cong,et al.  An efficient design and implementation of LSM-tree based key-value store on open-channel SSD , 2014, EuroSys '14.

[14]  Michael A. Bender,et al.  A General-Purpose Counting Filter: Making Every Bit Count , 2017, SIGMOD Conference.

[15]  H BloomBurton Space/time trade-offs in hash coding with allowable errors , 1970 .

[16]  Berthold Vöcking,et al.  Balanced allocations: the heavily loaded case , 2000, STOC '00.

[17]  Phelim Bradley,et al.  Ultra-fast search of all deposited bacterial and viral genomic data , 2019, Nature Biotechnology.

[18]  Arend Hintze,et al.  Scaling metagenome sequence assembly with probabilistic de Bruijn graphs , 2011, Proceedings of the National Academy of Sciences.

[19]  Larry Carter,et al.  Exact and approximate membership testers , 1978, STOC.

[20]  Peter Sanders,et al.  Cache-, hash-, and space-efficient bloom filters , 2009, JEAL.

[21]  David Hung-Chang Du,et al.  A Forest-structured Bloom Filter with flash memory , 2011, 2011 IEEE 27th Symposium on Mass Storage Systems and Technologies (MSST).

[22]  Alexander M. Rush,et al.  Weightless: Lossy Weight Encoding For Deep Neural Network Compression , 2018, ICML.

[23]  Manos Athanassoulis,et al.  Monkey: Optimal Navigable Key-Value Store , 2017, SIGMOD Conference.

[24]  Michael A. Bender,et al.  Don't Thrash: How to Cache Your Hash on Flash , 2011, Proc. VLDB Endow..

[25]  Ittai Abraham,et al.  PebblesDB: Building Key-Value Stores using Fragmented Log-Structured Merge Trees , 2017, SOSP.

[26]  David Hutchison,et al.  Scalable Bloom Filters , 2007, Inf. Process. Lett..

[27]  Jin Li,et al.  ChunkStash: Speeding Up Inline Storage Deduplication Using Flash Memory , 2010, USENIX Annual Technical Conference.

[28]  Miguel A. Mosteiro,et al.  Bootstrapping a hop-optimal network in the weak sensor model , 2009, TALG.

[29]  Eli Upfal,et al.  Balanced Allocations , 1999, SIAM J. Comput..

[30]  Nuwan Jayasena,et al.  Morton Filters: Faster, Space-Efficient Cuckoo Filters via Biasing, Compression, and Decoupled Logical Sparsity , 2018, Proc. VLDB Endow..

[31]  B. Berger,et al.  Targeted Genotyping of Variable Number Tandem Repeats with AdVNTR , 2018, RECOMB.

[32]  Kenneth A. Ross,et al.  Buffered Bloom Filters on Solid State Storage , 2010, ADMS@VLDB.

[33]  Philip Shilane,et al.  Optimal Hashing in External Memory , 2018, ICALP.

[34]  Michael A. Bender,et al.  Squeakr: An Exact and Approximate k-mer Counting System , 2017, bioRxiv.

[35]  Prashant Pandey,et al.  An Efficient, Scalable and Exact Representation of High-Dimensional Color Information Enabled via de Bruijn Graph Search , 2018, bioRxiv.

[36]  Li Fan,et al.  Summary cache: a scalable wide-area web cache sharing protocol , 2000, TNET.

[37]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[38]  Michael A. Bender,et al.  deBGR: an efficient and near-exact representation of the weighted de Bruijn graph , 2017, Bioinform..

[39]  George Varghese,et al.  An Improved Construction for Counting Bloom Filters , 2006, ESA.

[40]  Michael A. Bender,et al.  The TokuFS Streaming File System , 2012, HotStorage.

[41]  Chen Li,et al.  Storage Management in AsterixDB , 2014, Proc. VLDB Endow..

[42]  Prashant Pandey,et al.  An Efficient, Scalable, and Exact Representation of High-Dimensional Color Information Enabled Using de Bruijn Graph Search. , 2020, Journal of computational biology : a journal of computational molecular cell biology.

[43]  Roy Friedman,et al.  Counting with TinyTable: Every bit counts! , 2015, 2015 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS).

[44]  Andrei Broder,et al.  Network Applications of Bloom Filters: A Survey , 2004, Internet Math..

[45]  Zhichao Cao,et al.  Characterizing, Modeling, and Benchmarking RocksDB Key-Value Workloads at Facebook , 2020, FAST.

[46]  Michael A. Bender,et al.  A Fast x86 Implementation of Select , 2017, ArXiv.

[47]  Song Jiang,et al.  Workload analysis of a large-scale key-value store , 2012, SIGMETRICS '12.

[48]  Rayan Chikhi,et al.  Space-efficient and exact de Bruijn graph representation based on a Bloom filter , 2012, Algorithms for Molecular Biology.

[49]  Richard P. Spillane,et al.  SplinterDB: Closing the Bandwidth Gap for NVMe Key-Value Stores , 2020, USENIX Annual Technical Conference.

[50]  J. Ian Munro,et al.  Robin hood hashing , 1985, 26th Annual Symposium on Foundations of Computer Science (sfcs 1985).

[51]  S. Srinivasa Rao,et al.  An optimal Bloom filter replacement , 2005, SODA '05.