Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters

Abstract Using a sequence's k-mer content rather than the full sequence directly has enabled significant performance improvements in several sequencing applications, such as metagenomic species identification, estimation of transcript abundances, and alignment-free comparison of sequencing data. As k-mer sets often reach hundreds of millions of elements, traditional data structures are often impractical for k-mer set storage, and Bloom filters (BFs) and their variants are used instead. BFs reduce the memory footprint required to store millions of k-mers while allowing for fast set containment queries, at the cost of a low false positive rate (FPR). We show that, because k-mers are derived from sequencing reads, the information about k-mer overlap in the original sequence can be used to reduce the FPR up to 30 × with little or no additional memory and with set containment queries that are only 1.3 – 1.6 times slower. Alternatively, we can leverage k-mer overlap information to store k-mer sets in about half the space while maintaining the original FPR. We consider several variants of such k-mer Bloom filters (kBFs), derive theoretical upper bounds for their FPR, and discuss their range of applications and limitations.

[1]  Xiaolong Wu,et al.  BLESS: Bloom filter-based error correction solution for high-throughput sequencing reads , 2014, Bioinform..

[2]  Robert Patro,et al.  Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using Lightweight Algorithms , 2013, ArXiv.

[3]  Arend Hintze,et al.  Scaling metagenome sequence assembly with probabilistic de Bruijn graphs , 2011, Proceedings of the National Academy of Sciences.

[4]  B. Langmead,et al.  Lighter: fast and memory-efficient sequencing error correction without counting , 2014, Genome Biology.

[5]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[6]  Carl Kingsford,et al.  Large-Scale Search of Transcriptomic Read Sets with Sequence Bloom Trees , 2015, bioRxiv.

[7]  Gregory Kucherov,et al.  Using cascading Bloom filters to improve the memory usage for de Brujin graphs , 2013, Algorithms for Molecular Biology.

[8]  Eran Halperin,et al.  Fast lossless compression via cascading Bloom filters , 2014, BMC Bioinformatics.

[9]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[10]  Bryan O'Sullivan,et al.  Using Bloom Filters for Large Scale Gene Sequence Analysis in Haskell , 2009, PADL.

[11]  Carl Kingsford,et al.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[12]  Björn Andersson,et al.  Classification of DNA sequences using Bloom filters , 2010, Bioinform..

[13]  Andrei Broder,et al.  Network Applications of Bloom Filters: A Survey , 2004, Internet Math..

[14]  Bonnie Berger,et al.  Traversing the k-mer Landscape of NGS Read Datasets for Quality Score Sparsification , 2014, RECOMB.

[15]  Darya Filippova,et al.  Improving Bloom Filter Performance on Sequence Data Using k -mer Bloom Filters , 2016, RECOMB.

[16]  Rob Patro,et al.  Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms , 2013, Nature Biotechnology.

[17]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[18]  Weiguo Liu,et al.  Accelerating error correction in high-throughput short-read DNA sequencing data with CUDA , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[19]  Carl Kingsford,et al.  Fast Search of Thousands of Short-Read Sequencing Experiments , 2015, Nature Biotechnology.

[20]  Dominique Lavenier,et al.  Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph , 2015, BMC Bioinformatics.

[21]  Jens Stoye,et al.  Bloom Filter Trie - A Data Structure for Pan-Genome Storage , 2015, WABI.

[22]  Rayan Chikhi,et al.  Space-efficient and exact de Bruijn graph representation based on a Bloom filter , 2012, Algorithms for Molecular Biology.