Exploring neighborhoods in large metagenome assembly graphs reveals hidden sequence diversity

Genomes computationally inferred from large metagenomic data sets are often incomplete and may be missing functionally important content and strain variation. We introduce an information retrieval system for large metagenomic data sets that exploits the sparsity of DNA assembly graphs to efficiently extract subgraphs surrounding an inferred genome. We apply this system to recover missing content from genome bins and show that substantial genomic sequence variation is present in a real metagenome. Our software implementation is available at https://github.com/spacegraphcats/ spacegraphcats under the 3-Clause BSD License.

[1]  Hing-Fung Ting,et al.  MEGAHIT v1.0: A fast and scalable metagenome assembler driven by advanced methodologies and community practices. , 2016, Methods.

[2]  Johannes Alneberg,et al.  DESMAN: a new tool for de novo extraction of strains from metagenomes , 2017, Genome Biology.

[3]  Arend Hintze,et al.  Scaling metagenome sequence assembly with probabilistic de Bruijn graphs , 2011, Proceedings of the National Academy of Sciences.

[4]  Philip D. Blood,et al.  Critical Assessment of Metagenome Interpretation—a benchmark of metagenomics software , 2017, Nature Methods.

[5]  Edoardo Pasolli,et al.  Extensive Unexplored Human Microbiome Diversity Revealed by Over 150,000 Genomes from Metagenomes Spanning Age, Geography, and Lifestyle , 2019, Cell.

[6]  Arvind Satyanarayan,et al.  Vega-Lite: A Grammar of Interactive Graphics , 2018, IEEE Transactions on Visualization and Computer Graphics.

[7]  C. T. Brown,et al.  Strain recovery from metagenomes , 2015, Nature Biotechnology.

[8]  Donovan H. Parks,et al.  Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life , 2017, Nature Microbiology.

[9]  Robert D. Finn,et al.  The Pfam protein families database: towards a more sustainable future , 2015, Nucleic Acids Res..

[10]  M. Kanehisa,et al.  BlastKOALA and GhostKOALA: KEGG Tools for Functional Characterization of Genome and Metagenome Sequences. , 2016, Journal of molecular biology.

[11]  Jaroslav Nesetril,et al.  Sparsity - Graphs, Structures, and Algorithms , 2012, Algorithms and combinatorics.

[12]  Shibu Yooseph,et al.  SPA: a short peptide assembler for metagenomic data , 2013, Nucleic acids research.

[13]  K. Pollard,et al.  An integrated metagenomics pipeline for strain profiling reveals novel patterns of bacterial transmission and biogeography , 2016, Genome research.

[14]  Blair D. Sullivan,et al.  Structural Sparsity of Complex Networks: Random Graph Models and Linear Algorithms , 2014, ArXiv.

[15]  Luiz Irber,et al.  Khmer Release V2.1: Software for Biological Sequence Analysis , 2017, J. Open Source Softw..

[16]  Yu-Chieh Liao,et al.  Accurate binning of metagenomic contigs via automated clustering sequences using information of genomic signatures and marker genes , 2016, Scientific Reports.

[17]  et al.,et al.  Jupyter Notebooks - a publishing format for reproducible computational workflows , 2016, ELPUB.

[18]  Johannes Alneberg,et al.  Genomes from uncultivated prokaryotes: a comparison of metagenome-assembled and single-amplified genomes , 2017, Microbiome.

[19]  M. Chleb ´ õk,et al.  Approximation Hardness of Dominating Set Problems in Bounded Degree Graphs , 2008 .

[20]  R. Dewhurst,et al.  Assembly of 913 microbial genomes from metagenomic sequencing of the cow rumen , 2018, Nature Communications.

[21]  Elena S. Kostryukova,et al.  MetaCherchant: analyzing genomic context of antibiotic resistance genes in gut microbiota , 2018, Bioinform..

[22]  Yu Xie,et al.  Federated Computing for the Masses--Aggregating Resources to Tackle Large-Scale Engineering Problems , 2014, Computing in Science & Engineering.

[23]  N. Segata,et al.  Shotgun metagenomics, from sampling to analysis , 2017, Nature Biotechnology.

[24]  Eric J. Alm,et al.  Tracking Strains in the Microbiome: Insights from Metagenomics and Models , 2015, Front. Microbiol..

[25]  Justin Zobel,et al.  Bandage: interactive visualization of de novo genome assemblies , 2015, bioRxiv.

[26]  Wes McKinney,et al.  pandas: a Foundational Python Library for Data Analysis and Statistics , 2011 .

[27]  Nancy Wilkins-Diehr,et al.  XSEDE: Accelerating Scientific Discovery , 2014, Computing in Science & Engineering.

[28]  Brian C. Thomas,et al.  A new view of the tree of life , 2016, Nature Microbiology.

[29]  Brian C. Thomas,et al.  Genome-Resolved Metagenomic Analysis Reveals Roles for Candidate Phyla and Other Microbial Community Members in Biogeochemical Transformations in Oil Reservoirs , 2016, mBio.

[30]  C. Brown,et al.  Evaluating Metagenome Assembly on a Simple Defined Community with Many Strain Variants , 2017, bioRxiv.

[31]  Christina Backes,et al.  BusyBee Web: metagenomic data analysis by bootstrapped supervised binning and annotation , 2017, Nucleic Acids Res..

[32]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[33]  John D. Hunter,et al.  Matplotlib: A 2D Graphics Environment , 2007, Computing in Science & Engineering.

[34]  C. Quince,et al.  Comparative metagenomic and rRNA microbial diversity characterization using archaeal and bacterial synthetic communities. , 2013, Environmental microbiology.

[35]  Paul Medvedev,et al.  Compacting de Bruijn graphs from sequencing data quickly and in low memory , 2016, Bioinform..

[36]  Harald R. Gruber-Vodicka,et al.  Chemosynthetic symbionts of marine invertebrate animals are capable of nitrogen fixation , 2016, Nature Microbiology.

[37]  J. Söding,et al.  Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold , 2018, bioRxiv.

[38]  C. Titus Brown,et al.  Crossing the streams: a framework for streaming analysis of short DNA sequencing reads , 2015, PeerJ Prepr..

[39]  K. Katoh,et al.  MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability , 2013, Molecular biology and evolution.

[40]  Frédéric Magoulès,et al.  MSPminer: abundance-based reconstitution of microbial pan-genomes from shotgun metagenomic data , 2017, bioRxiv.

[41]  P. Pevzner,et al.  metaSPAdes: a new versatile metagenomic assembler. , 2017, Genome research.

[42]  Miroslav Chlebík,et al.  Approximation hardness of dominating set problems in bounded degree graphs , 2008, Inf. Comput..

[43]  Jenna M. Lang,et al.  Strain- and plasmid-level deconvolution of a synthetic metagenome by sequencing proximity ligation products , 2014, PeerJ.

[44]  Romain Koszul,et al.  Metagenomic chromosome conformation capture (meta3C) unveils the diversity of chromosome organization in microorganisms , 2014, eLife.

[45]  Tom O. Delmont,et al.  Nitrogen-fixing populations of Planctomycetes and Proteobacteria are abundant in surface ocean metagenomes , 2018, Nature Microbiology.

[46]  Eric Jones,et al.  SciPy: Open Source Scientific Tools for Python , 2001 .

[47]  John D. Coates,et al.  Genome-resolved metagenomics identifies genetic mobility, metabolic interactions, and unexpected diversity in perchlorate-reducing communities , 2018, The ISME Journal.

[48]  Elaina D. Graham,et al.  Descriptor : The reconstruction of 2 , 631 draft metagenome-assembled genomes from the global oceans , 2018 .

[49]  Connor T. Skennerton,et al.  CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes , 2015, Genome research.

[50]  Marcin Pilipczuk,et al.  Empirical Evaluation of Approximation Algorithms for Generalized Graph Coloring and Uniform Quasi-wideness , 2018, SEA.

[51]  Brian C. Thomas,et al.  Accurate, multi-kb reads resolve complex populations and detect rare microorganisms , 2015, Genome research.

[52]  R. Daniel Bergeron,et al.  PALADIN: protein alignment for functional profiling whole metagenome shotgun data , 2016, bioRxiv.

[53]  Harald R. Gruber-Vodicka,et al.  gbtools: Interactive Visualization of Metagenome Bins in R , 2015, Front. Microbiol..

[54]  Felix Reidl,et al.  Structural sparseness and complex networks , 2016 .

[55]  Ian T. Foster,et al.  Jetstream: a self-provisioned, scalable science and engineering cloud environment , 2015, XSEDE.

[56]  Gaël Varoquaux,et al.  The NumPy Array: A Structure for Efficient Numerical Computation , 2011, Computing in Science & Engineering.

[57]  Rayan Chikhi,et al.  Fast and scalable minimal perfect hashing for massive key sets , 2017, SEA.

[58]  Sven Rahmann,et al.  Genome analysis , 2022 .

[59]  Richard M. Karp,et al.  Reducibility Among Combinatorial Problems , 1972, 50 Years of Integer Programming.