DREAM-Yara: An implementation of an exact read mapper for very large databases

Motivation: Mapping-based approaches have become limited in their application to very large sets of references since computing an FM-index for very large databases (e.g. > 10 GB) has become a bottleneck. This affects many analyses that need this index as an essential step for approximate matching of the NGS reads to reference databases. For instance, in metagenomics, the size of the reference sequences has become prohibitive to compute a single full-text index on standard machines. Even on large memory machines, computing the index takes about one day of compute time. As a result, updates of indices are rarely performed. Hence, it is desirable to distribute the indices to solve the index construction and update problem while preserving fast search times. Results: To solve the index construction and update problem we propose the DREAM (Dynamic seaRchablE pArallel coMpressed index) framework in this paper and provide an implementation. The main contributions are the introduction binning directories using a novel use of Bloom filters. We combine several Bloom filters to form an interleaved Bloom filter and use this new data structure to quickly exclude reads for parts of the databases where they cannot match. This allows us to keep the databases in several indices which can be easily rebuild if parts are updated while maintaining a fast search time. The second main contribution lies in the integration of the Yara exact read mapper (Siragusa, 2013) in a distributed version for the DREAM framework.

[1]  Jouni Sirén,et al.  Compressed Suffix Arrays for Massive Data , 2009, SPIRE.

[2]  Wen J. Li,et al.  RefSeq: an update on prokaryotic genome annotation and curation , 2017, Nucleic Acids Res..

[3]  Scott Federhen,et al.  The NCBI Taxonomy database , 2011, Nucleic Acids Res..

[4]  Knut Reinert,et al.  The SeqAn C++ template library for efficient sequence analysis: A resource for programmers. , 2017, Journal of biotechnology.

[5]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[6]  Bernhard Y. Renard,et al.  SLIMM: species level identification of microorganisms from metagenomes , 2017, PeerJ.

[7]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[8]  Alistair Moffat,et al.  From Theory to Practice: Plug and Play with Succinct Data Structures , 2013, SEA.

[9]  Knut Reinert,et al.  Lambda: the local aligner for massive biological data , 2014, Bioinform..

[10]  Enrico Siragusa,et al.  Approximate string matching for high-throughput sequencing , 2015 .

[11]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[12]  Bernhard Y. Renard,et al.  DUDes: a top-down taxonomic profiler for metagenomics , 2016, Bioinform..

[13]  Phelim Bradley,et al.  Real-time search of all bacterial and viral genomic data , 2017, bioRxiv.

[14]  Mauro Leoncini,et al.  Approximation algorithms for a hierarchically structured bin packing problem , 2004, Inf. Process. Lett..

[15]  Gregory Kucherov,et al.  RNF: a general framework to evaluate NGS read mappers , 2015, Bioinform..

[16]  Manuel Holtgrewe,et al.  Mason – A Read Simulator for Second Generation Sequencing Data , 2010 .

[17]  C. Quince,et al.  Comparative metagenomic and rRNA microbial diversity characterization using archaeal and bacterial synthetic communities. , 2013, Environmental microbiology.

[18]  Giovanna Rosone,et al.  Lightweight BWT Construction for Very Large String Collections , 2011, CPM.

[19]  Ying Zhang,et al.  Computational pan-genomics: status, promises and challenges , 2016, bioRxiv.

[20]  Knut Reinert,et al.  RazerS 3: Faster, fully sensitive read mapping , 2012, Bioinform..

[21]  N. Warthmann,et al.  Simultaneous alignment of short reads against multiple genomes , 2009, Genome Biology.

[22]  Knut Reinert,et al.  Journaled string tree - a scalable data structure for analyzing thousands of similar genomes on your laptop , 2014, Bioinform..