REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets

Motivation In this work we present REINDEER, a novel computational method that performs indexing of sequences and records their abundances across a collection of datasets. To the best of our knowledge, other indexing methods have so far been unable to record abundances efficiently across large datasets. Results We used REINDEER to index the abundances of sequences within 2,585 human RNA-seq experiments in 45 hours using only 56 GB of RAM. This makes REINDEER the first method able to record abundances at the scale of 4 billion distinct k-mers across 2,585 datasets. REINDEER also supports exact presence/absence queries of k-mers. Briefly, REINDEER constructs the compacted de Bruijn graph (DBG) of each dataset, then conceptually merges those DBGs into a single global one. Then, REINDEER constructs and indexes monotigs, which in a nutshell are groups of k-mers of similar abundances. Availability https://github.com/kamimrcht/REINDEER Contact camille.marchet@univ-lille.fr

[1]  Paul Medvedev,et al.  Compacting de Bruijn graphs from sequencing data quickly and in low memory , 2016, Bioinform..

[2]  Páll Melsted,et al.  Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs , 2019, Genome Biology.

[3]  Amatur Rahman,et al.  Representation of k-mer sets using spectrum-preserving string sets , 2020, bioRxiv.

[4]  Phelim Bradley,et al.  Ultra-fast search of all deposited bacterial and viral genomic data , 2019, Nature Biotechnology.

[5]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[6]  Paul Medvedev,et al.  Improved Representation of Sequence Bloom Trees , 2018, bioRxiv.

[7]  K. Tomczak,et al.  The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge , 2015, Contemporary oncology.

[8]  Pierre Peterlongo,et al.  A resource-frugal probabilistic dictionary and applications in bioinformatics , 2017, Discret. Appl. Math..

[9]  Michael Roberts,et al.  Reducing storage requirements for biological sequence comparison , 2004, Bioinform..

[10]  Ellen T. Gelfand,et al.  The Genotype-Tissue Expression (GTEx) project , 2013, Nature Genetics.

[11]  Jinshun Zhao,et al.  Roles of FoxM1 in cell regulation and breast cancer targeting therapy , 2017, Medical Oncology.

[12]  Christina Boucher,et al.  Succinct Colored de Bruijn Graphs , 2016, bioRxiv.

[13]  Guy Cochrane,et al.  The European Nucleotide Archive in 2019 , 2019, Nucleic Acids Res..

[14]  Lior Pachter,et al.  Near-optimal probabilistic RNA-seq quantification , 2016, Nature Biotechnology.

[15]  N. Bardeesy,et al.  On oncogenes and tumor suppressor genes in the mammary gland. , 2012, Cold Spring Harbor perspectives in biology.

[16]  Carl Kingsford,et al.  Fast Search of Thousands of Short-Read Sequencing Experiments , 2015, Nature Biotechnology.

[17]  Sebastian Deorowicz,et al.  KMC 3: counting and manipulating k‐mer statistics , 2017, Bioinform..

[18]  Rayan Chikhi,et al.  Fast and scalable minimal perfect hashing for massive key sets , 2017, SEA.

[19]  Carl Kingsford,et al.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[20]  Jens Stoye,et al.  Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage , 2016, Algorithms for Molecular Biology.

[21]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[22]  Paul Medvedev,et al.  Data structures to represent sets of k-long DNA sequences , 2019, ArXiv.

[23]  Christina Boucher,et al.  Data structures based on k-mers for querying large collections of sequencing data sets , 2019, bioRxiv.

[24]  Ye Yu,et al.  SeqOthello: querying RNA-seq experiments at scale , 2018, Genome Biology.

[25]  Chao Xie,et al.  Fast and sensitive protein alignment using DIAMOND , 2014, Nature Methods.

[26]  Ning Ma,et al.  BLAST+: architecture and applications , 2009, BMC Bioinformatics.

[27]  Ole Schulz-Trieglaff,et al.  BEETL-fastq: a searchable compressed archive for DNA reads , 2014, Bioinform..

[28]  Camille Marchet,et al.  Indexing De Bruijn graphs with minimizers , 2019, RECOMB 2019.

[29]  Gregory Kucherov,et al.  Simplitigs as an efficient and scalable representation of de Bruijn graphs , 2020, Genome Biology.

[30]  Phelim Bradley,et al.  COBS: a Compact Bit-Sliced Signature Index , 2019, SPIRE.