MetaGraph: Indexing and Analysing Nucleotide Archives at Petabase-scale

The amount of biological sequencing data available in public repositories is growing exponentially, forming an invaluable biomedical research resource. Yet, making all this sequencing data searchable and easily accessible to life science and data science researchers is an unsolved problem. We present MetaGraph, a versatile framework for the scalable analysis of extensive sequence repositories. MetaGraph efficiently indexes vast collections of sequences to enable fast search and comprehensive analysis. A wide range of underlying data structures offer different practically relevant trade-offs between the space taken by the index and its query performance. MetaGraph provides a flexible methodological framework allowing for index construction to be scaled from consumer laptops to distribution onto a cloud compute cluster for processing terabases to petabases of input data. Achieving compression ratios of up to 1,000-fold over the already compressed raw input data, MetaGraph can represent the content of large sequencing archives in the working memory of a single compute server. We demonstrate our framework’s scalability by indexing over 1.4 million whole genome sequencing (WGS) records from NCBI’s Sequence Read Archive, representing a total input of more than three petabases. Besides demonstrating the utility of MetaGraph indexes on key applications, such as experiment discovery, sequence alignment, error correction, and differential assembly, we make a wide range of indexes available as a community resource, including those over 450,000 microbial WGS records, more than 110,000 fungi WGS records, and more than 20,000 whole metagenome sequencing records. A subset of these indexes is made available online for interactive queries. All indexes created from public data comprising in total more than 1 million records are available for download or usage in the cloud. As an example of our indexes’ integrative analysis capabilities, we introduce the concept of differential assembly, which allows for the extraction of sequences present in a foreground set of samples but absent in a given background set. We apply this technique to differentially assemble contigs to identify pathogenic agents transfected via human kidney transplants. In a second example, we indexed more than 20,000 human RNA-Seq records from the TCGA and GTEx cohorts and use them to extract transcriptome features that are hard to characterize using a classical linear reference. We discovered over 200 trans-splicing events in GTEx and found broad evidence for tissue-specific non-A-to-I RNA-editing in GTEx and TCGA.

[1]  Li Ding,et al.  Scalable Open Science Approach for Mutation Calling of Tumor Exomes Using Multiple Genomic Pipelines. , 2018, Cell systems.

[2]  Leonid Oliker,et al.  Extreme Scale De Novo Metagenome Assembly , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Thomas R. Gingeras,et al.  STAR: ultrafast universal RNA-seq aligner , 2013, Bioinform..

[4]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[5]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[6]  Pierre Peterlongo,et al.  Read mapping on de Bruijn graphs , 2015, BMC Bioinformatics.

[7]  Sebastian Deorowicz,et al.  KMC 3: counting and manipulating k‐mer statistics , 2017, Bioinform..

[8]  Jeff Daily,et al.  Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments , 2016, BMC Bioinformatics.

[9]  Prashant Pandey,et al.  Rainbowfish: A Succinct Colored de Bruijn Graph Representation , 2017, bioRxiv.

[10]  Phelim Bradley,et al.  COBS: a Compact Bit-Sliced Signature Index , 2019, SPIRE.

[11]  Veli Mäkinen,et al.  Bit-parallel sequence-to-graph alignment , 2019, Bioinform..

[12]  Wen J. Li,et al.  Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation , 2015, Nucleic Acids Res..

[13]  M. Schatz,et al.  Big Data: Astronomical or Genomical? , 2015, PLoS biology.

[14]  Benedict Paten,et al.  Distance indexing and seed clustering in sequence graphs , 2019, bioRxiv.

[15]  Lukas Wagner,et al.  A Greedy Algorithm for Aligning DNA Sequences , 2000, J. Comput. Biol..

[16]  P. Pevzner,et al.  metaSPAdes: a new versatile metagenomic assembler. , 2017, Genome research.

[17]  G. McVean,et al.  De novo assembly and genotyping of variants using colored de Bruijn graphs , 2011, Nature Genetics.

[18]  Patrick Mäder,et al.  Recommending plant taxa for supporting on-site species identification , 2018, BMC Bioinformatics.

[19]  Brian D. Ondov,et al.  Mash: fast genome and metagenome distance estimation using MinHash , 2015, Genome Biology.

[20]  Christina Boucher,et al.  Succinct Dynamic de Bruijn Graphs , 2020, bioRxiv.

[21]  Leping Li,et al.  ART: a next-generation sequencing read simulator , 2012, Bioinform..

[22]  Phelim Bradley,et al.  Ultra-fast search of all deposited bacterial and viral genomic data , 2019, Nature Biotechnology.

[23]  Chirag Jain,et al.  Accelerating Sequence Alignment to Graphs , 2019, bioRxiv.

[24]  Christina Boucher,et al.  Succinct Colored de Bruijn Graphs , 2016, bioRxiv.

[25]  Benedict Paten,et al.  A graph extension of the positional Burrows–Wheeler transform and its applications , 2016, Algorithms for Molecular Biology.

[26]  H. Tettelin,et al.  The microbial pan-genome. , 2005, Current opinion in genetics & development.

[27]  Kunihiko Sadakane,et al.  Succinct de Bruijn Graphs , 2012, WABI.

[28]  Glenn Hickey,et al.  Genotyping structural variants in pangenome graphs using the vg toolkit , 2019, Genome Biology.

[29]  Prashant Pandey,et al.  An Efficient, Scalable and Exact Representation of High-Dimensional Color Information Enabled via de Bruijn Graph Search , 2018, bioRxiv.

[30]  Rasko Leinonen,et al.  The sequence read archive: explosive growth of sequencing data , 2011, Nucleic Acids Res..

[31]  Ellen T. Gelfand,et al.  The Genotype-Tissue Expression (GTEx) project , 2013, Nature Genetics.

[32]  Gunnar Rätsch,et al.  AStarix: Fast and Optimal Sequence-to-Graph Alignment , 2020, RECOMB.

[33]  Kunihiko Sadakane,et al.  MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph , 2014, Bioinform..

[34]  Carl Kingsford,et al.  Improved Search of Large Transcriptomic Sequencing Databases Using Split Sequence Bloom Trees , 2016, bioRxiv.

[35]  Tobias Marschall,et al.  GraphAligner: rapid and versatile sequence-to-graph alignment , 2019, Genome Biology.

[36]  Michael Huber,et al.  Metagenomic Virome Sequencing in Living Donor and Recipient Kidney Transplant Pairs Revealed JC Polyomavirus Transmission , 2018, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[37]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[38]  Christina Boucher,et al.  Metagenome SNP calling via read-colored de Bruijn graphs , 2020, Bioinform..

[39]  Gunnar Rätsch,et al.  Global Genetic Cartography of Urban Metagenomes and Anti-Microbial Resistance , 2019, bioRxiv.

[40]  Luke Zappia,et al.  Opportunities and challenges in long-read sequencing data analysis , 2020, Genome Biology.

[41]  Gil McVean,et al.  Integrating long-range connectivity information into de Bruijn graphs , 2017, bioRxiv.

[42]  Tobias Marschall,et al.  GraphAligner: rapid and versatile sequence-to-graph alignment , 2020, Genome biology.

[43]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[44]  Leen Stougie,et al.  Strain-aware assembly of genomes from mixed samples using flow variation graphs , 2019 .

[45]  Thomas C. Conway,et al.  Succinct data structures for assembling large genomes , 2010, Bioinform..

[46]  Richard Durbin,et al.  Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT) , 2014, Bioinform..

[47]  Kiyoshi Asai,et al.  PBSIM: PacBio reads simulator - toward accurate genome assembly , 2013, Bioinform..

[48]  Gunnar Rätsch,et al.  Sparse Binary Relation Representations for Genome Graph Annotation , 2018, bioRxiv.

[49]  Geoffrey L. Winsor,et al.  CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database , 2019, Nucleic Acids Res..

[50]  Robert D. Finn,et al.  A unified catalog of 204,938 reference genomes from the human gut microbiome , 2020, Nature Biotechnology.

[51]  Chun-Nan Hsu,et al.  Weakly supervised learning of biomedical information extraction from curated data , 2016, BMC Bioinformatics.

[52]  Gunnar Rätsch,et al.  AStarix: Fast and Optimal Sequence-to-Graph Alignment , 2020, bioRxiv.

[53]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[54]  Ryan L. Collins,et al.  The mutational constraint spectrum quantified from variation in 141,456 humans , 2020, Nature.

[55]  Jens Stoye,et al.  Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage , 2016, Algorithms for Molecular Biology.

[56]  Prashant Pandey,et al.  An Efficient, Scalable and Exact Representation of High-Dimensional Color Information Enabled via de Bruijn Graph Search , 2019, RECOMB.

[57]  Daniel N. Baker,et al.  KrakenUniq: confident and fast metagenomics classification using unique k-mer counts , 2018, Genome Biology.

[58]  David A. Fitzpatrick,et al.  Pan-genome analyses of model fungal species , 2019, Microbial genomics.

[59]  William Jones,et al.  Variation graph toolkit improves read mapping by representing genetic variation in the reference , 2018, Nature Biotechnology.

[60]  Sergey Koren,et al.  Mash Screen: high-throughput sequence containment estimation for genome discovery , 2019, Genome Biology.

[61]  Irina M. Armean,et al.  The mutational constraint spectrum quantified from variation in 141,456 humans , 2019, Nature.

[62]  Paul Medvedev,et al.  Improved Representation of Sequence Bloom Trees , 2018, bioRxiv.

[63]  Yadong Wang,et al.  deBGA: read alignment with de Bruijn graph-based seed and extension , 2016, Bioinform..

[64]  Yves Van de Peer,et al.  BrownieAligner: accurate alignment of Illumina sequencing data to de Bruijn graphs , 2018, BMC Bioinform..

[65]  Leonid Oliker,et al.  HipMer: an extreme-scale de novo genome assembler , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.