Sub-linear Sequence Search via a Repeated And Merged Bloom Filter (RAMBO)

Whole-genome shotgun sequencing (WGS), especially that of microbial genomes, has been the core of recent research advances in large-scale comparative genomics. The data deluge has resulted in exponential growth in genomic datasets over the past years and has shown no sign of slowing down. Several recent attempts have been made to tame the computational burden of read classification and sequence search on these ultra large-scale datasets, including both raw reads and assembled genomes. A notable recent method is BigSI. BigSI is based around bloom filters and offers very efficient query sequence search times. However, querying with BigSI still requires probing Bloom filters (or sets of bitslices) which scales linearly with the number of datasets. As a result, scaling up BigSI for datasets with potentially millions (or higher) samples is likely prohibitive. In this paper, we propose RAMBO (Repeated and Merged Bloom Filter) where the number of Bloom filter probes is significantly less than BigSI due to sub-linear scaling for the same false-positive rate. Our idea is theoretically sound and inspired by the count-min sketch data structure, a popular streaming algorithm. RAMBO provides a significant improvement over BigSI in terms of query time when evaluated on real genome datasets. Furthermore, due to sub-linear scaling, the larger the size and number of datasets, the bigger the gains are with RAMBO over BigSI.

[1]  Sergey Koren,et al.  Mash Screen: high-throughput sequence containment estimation for genome discovery , 2019, Genome Biology.

[2]  B. Berger,et al.  Compressive genomics , 2012, Nature Biotechnology.

[3]  Brian D. Ondov,et al.  Mash: fast genome and metagenome distance estimation using MinHash , 2015, Genome Biology.

[4]  Phelim Bradley,et al.  Ultra-fast search of all deposited bacterial and viral genomic data , 2019, Nature Biotechnology.

[5]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[6]  Chen Sun,et al.  AllSome Sequence Bloom Trees , 2016, bioRxiv.

[7]  Sergey Koren,et al.  Mash Screen: high-throughput sequence containment estimation for genome discovery , 2019, Genome Biology.

[8]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[9]  Evan S Snitkin,et al.  Tracking a Hospital Outbreak of Carbapenem-Resistant Klebsiella pneumoniae with Whole-Genome Sequencing , 2012, Science Translational Medicine.

[10]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[11]  Ben Langmead,et al.  The DNA Data Deluge: Fast, efficient genome sequencing machines are spewing out more data than geneticists can analyze. , 2013, IEEE spectrum.

[12]  Moses Charikar,et al.  Finding frequent items in data streams , 2004, Theor. Comput. Sci..

[13]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[14]  Carl Kingsford,et al.  Improved Search of Large Transcriptomic Sequencing Databases Using Split Sequence Bloom Trees , 2017, RECOMB.

[15]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[16]  Carl Kingsford,et al.  Fast Search of Thousands of Short-Read Sequencing Experiments , 2015, Nature Biotechnology.

[17]  Rasko Leinonen,et al.  The sequence read archive: explosive growth of sequencing data , 2011, Nucleic Acids Res..

[18]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[19]  J. McPherson,et al.  Coming of age: ten years of next-generation sequencing technologies , 2016, Nature Reviews Genetics.

[20]  Adina Crainiceanu Bloofi: a hierarchical Bloom filter index with applications to distributed data provenance , 2013, Cloud-I '13.

[21]  Hooman Zabeti,et al.  Improving MinHash via the containment index with applications to metagenomic analysis , 2019, Appl. Math. Comput..

[22]  Ruth Timme,et al.  The Public Health Impact of a Publically Available, Environmental Database of Microbial Genomes , 2017, Front. Microbiol..

[23]  Anshumali Shrivastava,et al.  RAMBO: Repeated And Merged Bloom Filter for Multiple Set Membership Testing (MSMT) in Sub-linear time , 2019, ArXiv.

[24]  Marios Hadjieleftheriou,et al.  R-Trees - A Dynamic Index Structure for Spatial Searching , 2008, ACM SIGSPATIAL International Workshop on Advances in Geographic Information Systems.

[25]  Larry Carter,et al.  Exact and approximate membership testers , 1978, STOC.

[26]  M. C. Schatz,et al.  The DNA data deluge , 2013, IEEE Spectrum.