Spectrum preserving tilings enable sparse and modular reference indexing

The reference indexing problem for k-mers is to pre-process a collection of reference genomic sequences ℛ so that the position of all occurrences of any queried k-mer can be rapidly identified. An efficient and scalable solution to this problem is fundamental for many tasks in bioinformatics. In this work, we introduce the spectrum preserving tiling (SPT), a general representation of ℛ that specifies how a set of tiles repeatedly occur to spell out the constituent reference sequences in ℛ. By encoding the order and positions where tiles occur, SPTs enable the implementation and analysis of a general class of modular indexes. An index over an SPT decomposes the reference indexing problem for k-mers into: (1) a k-mer-to-tile mapping; and (2) a tile-to-occurrence mapping. Recently introduced work to construct and compactly index k-mer sets can be used to efficiently implement the k-mer-to-tile mapping. However, implementing the tile-to-occurrence mapping remains prohibitively costly in terms of space. As reference collections become large, the space requirements of the tile-to-occurrence mapping dominates that of the k-mer-to-tile mapping since the former depends on the amount of total sequence while the latter depends on the number of unique k-mers in ℛ. To address this, we introduce a class of sampling schemes for SPTs that trade off speed to reduce the size of the tile-to-reference mapping. We implement a practical index with these sampling schemes in the tool pufferfish2. When indexing over 30,000 bacterial genomes, pufferfish2 reduces the size of the tile-to-occurrence mapping from 86.3GB to 34.6GB while incurring only a 3.6× slowdown when querying k-mers from a sequenced readset. Supplementary materials Sections S.1 to S.8 available online at https://doi.org/10.5281/zenodo.7504717 Availability pufferfish2 is implemented in Rust and available at https://github.com/COMBINE-lab/pufferfish2.

[1]  B. Langmead,et al.  SPUMONI 2: Improved pangenome classification using a compressed index of minimizer digests , 2022, bioRxiv.

[2]  S. Puglisi,et al.  Succinct k-mer Sets Using Subset Rank Queries on the Spectral Burrows-Wheeler Transform * , 2022, bioRxiv.

[3]  Sebastian Schmidt,et al.  Eulertigs: minimum plain text representation of k-mer sets without repetitions in linear time , 2022, bioRxiv.

[4]  G. Pibiri On weighted k-mer dictionaries , 2022, bioRxiv.

[5]  Giulio Ermanno Pibiri,et al.  Sparse and skew hashing of K-mers , 2022, bioRxiv.

[6]  B. Langmead,et al.  MONI: A Pangenomic Index for Finding Maximal Exact Matches , 2022, J. Comput. Biol..

[7]  S. Deorowicz,et al.  Scalable, ultra-fast, and low-memory construction of compacted de Bruijn graphs with Cuttlefish 2 , 2021, bioRxiv.

[8]  G. Rätsch,et al.  Lossless indexing with counting de Bruijn graphs , 2021, bioRxiv.

[9]  R. Chikhi,et al.  Minimizer-space de Bruijn graphs: Whole-genome assembly of long reads in minutes on a personal computer. , 2021, Cell systems.

[10]  Fatemeh Almodaresi,et al.  PuffAligner: a fast, efficient and accurate aligner based on the Pufferfish index , 2021, Bioinform..

[11]  Gonzalo Navarro,et al.  A Fast and Small Subsampled R-index , 2021, CPM.

[12]  Rob Patro,et al.  Cuttlefish: fast, parallel and low-memory compaction of de Bruijn graphs from large-scale genome collections , 2020, bioRxiv.

[13]  K. Rudi,et al.  HumGut: a comprehensive human gut prokaryotic genomes collection filtered by metagenome data , 2020, bioRxiv.

[14]  E. Guinó,et al.  Gut microbiome diversity detected by high-coverage 16S and shotgun sequencing of paired stool and colon sample , 2020, Scientific Data.

[15]  Gregory Kucherov,et al.  Simplitigs as an efficient and scalable representation of de Bruijn graphs , 2020, Genome Biology.

[16]  Amatur Rahman,et al.  Representation of k-mer sets using spectrum-preserving string sets , 2020, bioRxiv.

[17]  Rossano Venturini,et al.  Techniques for Inverted Index Compression , 2019, ACM Comput. Surv..

[18]  Steven L Salzberg,et al.  Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype , 2019, Nature Biotechnology.

[19]  William Jones,et al.  Variation graph toolkit improves read mapping by representing genetic variation in the reference , 2018, Nature Biotechnology.

[20]  Uwe Baier,et al.  On Undetected Redundancy in the Burrows-Wheeler Transform , 2018, CPM.

[21]  Fatemeh Almodaresi,et al.  A space and time-efficient index for the compacted colored de Bruijn graph , 2017, bioRxiv.

[22]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[23]  Gonzalo Navarro,et al.  Optimal-Time Text Indexing in BWT-runs Bounded Space , 2017, SODA.

[24]  Rob Patro,et al.  Salmon provides fast and bias-aware quantification of transcript expression , 2017, Nature Methods.

[25]  Paul Medvedev,et al.  Compacting de Bruijn graphs from sequencing data quickly and in low memory , 2016, Bioinform..

[26]  Lior Pachter,et al.  Near-optimal probabilistic RNA-seq quantification , 2016, Nature Biotechnology.

[27]  Paul Medvedev,et al.  TwoPaCo: an efficient algorithm to build the compacted de Bruijn graph from many complete genomes , 2016, Bioinform..

[28]  Enno Ohlebusch,et al.  Graphical pan-genome analysis with compressed suffix trees and the Burrows-Wheeler transform , 2016, Bioinform..

[29]  Alexa B. R. McIntyre,et al.  Extensive sequencing of seven human genomes to characterize benchmark reference materials , 2015, Scientific Data.

[30]  Rob Patro,et al.  Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms , 2013, Nature Biotechnology.

[31]  Gonzalo Navarro,et al.  The Wavelet Matrix , 2012, SPIRE.

[32]  Kunihiko Sadakane,et al.  Succinct de Bruijn Graphs , 2012, WABI.

[33]  Alistair Moffat,et al.  Binary Interpolative Coding for Effective Index Compression , 2000, Information Retrieval.

[34]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .