Mantis: A Fast, Small, and Exact Large-Scale Sequence-Search Index.

Sequence-level searches on large collections of RNA sequencing experiments, such as the NCBI Sequence Read Archive (SRA), would enable one to ask many questions about the expression or variation of a given transcript in a population. Existing approaches, such as the sequence Bloom tree, suffer from fundamental limitations of the Bloom filter, resulting in slow build and query times, less-than-optimal space usage, and potentially large numbers of false-positives. This paper introduces Mantis, a space-efficient system that uses new data structures to index thousands of raw-read experiments and facilitates large-scale sequence searches. In our evaluation, index construction with Mantis is 6× faster and yields a 20% smaller index than the state-of-the-art split sequence Bloom tree (SSBT). For queries, Mantis is 6-108× faster than SSBT and has no false-positives or -negatives. For example, Mantis was able to search for all 200,400 known human transcripts in an index of 2,652 RNA sequencing experiments in 82 min; SSBT took close to 4 days.

[1]  G. McVean,et al.  De novo assembly and genotyping of variants using colored de Bruijn graphs , 2011, Nature Genetics.

[2]  Páll Melsted,et al.  Efficient counting of k-mers in DNA sequences using a bloom filter , 2011, BMC Bioinformatics.

[3]  A. Biegert,et al.  HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment , 2011, Nature Methods.

[4]  Michael A. Bender,et al.  A General-Purpose Counting Filter: Making Every Bit Count , 2017, SIGMOD Conference.

[5]  Chao Xie,et al.  Fast and sensitive protein alignment using DIAMOND , 2014, Nature Methods.

[6]  Rasko Leinonen,et al.  The sequence read archive: explosive growth of sequencing data , 2011, Nucleic Acids Res..

[7]  Michael A. Bender,et al.  Don't Thrash: How to Cache Your Hash on Flash , 2011, Proc. VLDB Endow..

[8]  Arend Hintze,et al.  Scaling metagenome sequence assembly with probabilistic de Bruijn graphs , 2011, Proceedings of the National Academy of Sciences.

[9]  Michael A. Bender,et al.  Squeakr: An Exact and Approximate k-mer Counting System , 2017, bioRxiv.

[10]  Chen Sun,et al.  AllSome Sequence Bloom Trees , 2016, bioRxiv.

[11]  Rajeev Raman,et al.  Succinct indexable dictionaries with applications to encoding k-ary trees and multisets , 2002, SODA '02.

[12]  Johannes Söding,et al.  MMseqs2: sensitive protein sequence searching for the analysis of massive data sets , 2017, bioRxiv.

[13]  Michael A. Bender,et al.  deBGR: an efficient and near-exact representation of the weighted de Bruijn graph , 2017, Bioinform..

[14]  Lenore Cowen,et al.  Compressive genomics for protein databases , 2013, Bioinform..

[15]  Marco Previtali,et al.  Fully Dynamic de Bruijn Graphs , 2016, SPIRE.

[16]  Hamid Mohamadi,et al.  ntCard: a streaming algorithm for cardinality estimation in genomics data , 2017, Bioinform..

[17]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[18]  Prashant Pandey,et al.  Rainbowfish: A Succinct Colored de Bruijn Graph Representation , 2017 .

[19]  Carl Kingsford,et al.  Improved Search of Large Transcriptomic Sequencing Databases Using Split Sequence Bloom Trees , 2017, RECOMB.

[20]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[21]  Carl Kingsford,et al.  Fast Search of Thousands of Short-Read Sequencing Experiments , 2015, Nature Biotechnology.

[22]  Jens Stoye,et al.  Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage , 2016, Algorithms for Molecular Biology.

[23]  Carl Kingsford,et al.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[24]  Christina Boucher,et al.  Succinct Colored de Bruijn Graphs , 2016, bioRxiv.

[25]  Patrick E. O'Neil,et al.  The log-structured merge-tree (LSM-tree) , 1996, Acta Informatica.