A new method for rapid genome classification, clustering, visualization, and novel taxa discovery from metagenome

Current supervised phylogeny-based methods fall short on recognizing species assembled from metagenomic datasets from under-investigated habitats, as they are often incomplete or lack closely known relatives. Here, we report an efficient software suite, “Genome Constellation”, that estimates similarities between genomes based on their k-mer matches, and subsequently uses these similarities for classification, clustering, and visualization. The clusters of reference genomes formed by Genome Constellation closely resemble known phylogenetic relationships while simultaneously revealing unexpected connections. In a dataset containing 1,693 draft genomes assembled from the Antarctic lake communities where only 40% could be placed in a phylogenetic tree, Genome Constellation improves taxa assignment to 61%. It revealed six clusters derived from new bacterial phyla and 63 new giant viruses, 3 of which missed by the traditional marker-based approach. In summary, we demonstrate that Genome Constellation can tackle the computational and algorithmic challenges in large-scale taxonomy analyses in metagenomics.

[1]  Wei Li,et al.  Influence of Environmental Drivers and Potential Interactions on the Distribution of Microbial Communities From Three Permanently Stratified Antarctic Lakes , 2019, Front. Microbiol..

[2]  P. Bork,et al.  Interactive Tree Of Life (iTOL) v4: recent updates and new developments , 2019, Nucleic Acids Res..

[3]  Feng Li,et al.  MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies , 2019, PeerJ.

[4]  Phelim Bradley,et al.  Ultra-fast search of all deposited bacterial and viral genomic data , 2019, Nature Biotechnology.

[5]  Edoardo Pasolli,et al.  Extensive Unexplored Human Microbiome Diversity Revealed by Over 150,000 Genomes from Metagenomes Spanning Age, Geography, and Lifestyle , 2019, Cell.

[6]  I-Min A. Chen,et al.  IMG/M v.5.0: an integrated data management and comparative analysis system for microbial genomes and microbiomes , 2018, Nucleic Acids Res..

[7]  Donovan H. Parks,et al.  A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life , 2018, Nature Biotechnology.

[8]  Gesine Reinert,et al.  Alignment-Free Sequence Analysis and Applications. , 2018, Annual review of biomedical data science.

[9]  A. Phillippy,et al.  High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries , 2017, Nature Communications.

[10]  Jonas S. Almeida,et al.  Alignment-free sequence comparison: benefits, applications, and tools , 2017, Genome Biology.

[11]  Yongkun Li,et al.  A novel fast vector method for genetic sequence comparison , 2017, Scientific Reports.

[12]  J. Banfield,et al.  dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication , 2017, The ISME Journal.

[13]  P. Doran,et al.  Constraining the recent history of the perennially ice-covered Lake Bonney, East Antarctica using He, Kr and Xe concentrations , 2017 .

[14]  David Jonker,et al.  Graph mapping: Multi-scale community visualization of massive graph data , 2017, Inf. Vis..

[15]  J. Priscu,et al.  Niche specialization of bacteria in permanently ice‐covered lakes of the McMurdo Dry Valleys, Antarctica , 2017, Environmental microbiology.

[16]  Hing-Fung Ting,et al.  MEGAHIT v1.0: A fast and scalable metagenome assembler driven by advanced methodologies and community practices. , 2016, Methods.

[17]  Shawn Rynearson,et al.  Taxonomer: an interactive metagenomics analysis portal for universal pathogen detection and host mRNA expression profiling , 2016, Genome Biology.

[18]  M. Podar,et al.  Ultrastructural and Single-Cell-Level Characterization Reveals Metabolic Versatility in a Microbial Eukaryote Community from an Ice-Covered Antarctic Lake , 2016, Applied and Environmental Microbiology.

[19]  Leonid Oliker,et al.  HipMer: an extreme-scale de novo genome assembler , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[20]  J. Eisen,et al.  Microbial Mat Communities along an Oxygen Gradient in a Perennially Ice-Covered Antarctic Lake , 2015, Applied and Environmental Microbiology.

[21]  Brian D. Ondov,et al.  Mash: fast genome and metagenome distance estimation using MinHash , 2015, Genome Biology.

[22]  Natalia N. Ivanova,et al.  Microbial species delineation using whole genome sequences , 2015, Nucleic acids research.

[23]  Connor T. Skennerton,et al.  CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes , 2015, Genome research.

[24]  A. von Haeseler,et al.  IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies , 2014, Molecular biology and evolution.

[25]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[26]  J. Priscu,et al.  Modular community structure suggests metabolic plasticity during the transition to polar night in ice-covered Antarctic lakes , 2013, The ISME Journal.

[27]  M. Wilkins,et al.  Genome Sequence of Dehalobacter UNSWDHB, a Chloroform-Dechlorinating Bacterium , 2013, Genome Announcements.

[28]  Natalia N. Ivanova,et al.  Insights into the phylogeny and coding potential of microbial dark matter , 2013, Nature.

[29]  J. Priscu,et al.  Physical Limnology of the Mcmurdo Dry Valleys Lakes , 2013 .

[30]  D. Sumner,et al.  Timescales of Growth Response of Microbial Mats to Environmental Change in an Ice-Covered Antarctic Lake , 2013, Biology.

[31]  K. Katoh,et al.  MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability , 2013, Molecular biology and evolution.

[32]  E. Edwards,et al.  Semi-Automatic In Silico Gap Closure Enabled De Novo Assembly of Two Dehalobacter Genomes from Metagenomic Data , 2012, PloS one.

[33]  Evgeny M. Zdobnov,et al.  OrthoDB: a hierarchical catalog of animal, fungal and bacterial orthologs , 2012, Nucleic Acids Res..

[34]  I. Friedberg,et al.  Protist diversity in a permanently ice-covered Antarctic Lake during the polar night transition , 2011, The ISME Journal.

[35]  François Laviolette,et al.  Ray: Simultaneous Assembly of Reads from a Mix of High-Throughput Sequencing Technologies , 2010, J. Comput. Biol..

[36]  Alexis Criscuolo,et al.  BMGE (Block Mapping and Gathering with Entropy): a new software for selection of phylogenetic informative regions from multiple sequence alignments , 2010, BMC Evolutionary Biology.

[37]  Paramvir S. Dehal,et al.  FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments , 2010, PloS one.

[38]  C. Tappert,et al.  A Survey of Binary Similarity and Distance Measures , 2010 .

[39]  Natalya Yutin,et al.  Eukaryotic large nucleo-cytoplasmic DNA viruses: Clusters of orthologous genes and reconstruction of viral genome evolution , 2009, Virology Journal.

[40]  Toni Gabaldón,et al.  trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses , 2009, Bioinform..

[41]  A. Anbar,et al.  A Contemporary Microbially Maintained Subglacial Ferrous "Ocean" , 2009, Science.

[42]  W. Lyons,et al.  The Saline Lakes of the McMurdo Dry Valleys, Antarctica , 2009 .

[43]  J. Eisen,et al.  A simple, fast, and accurate method of phylogenomic inference , 2008, Genome Biology.

[44]  B. Rannala,et al.  Phylogenetic inference using whole genomes. , 2008, Annual review of genomics and human genetics.

[45]  M. Wagner,et al.  Microbial diversity and the genetic nature of microbial species , 2008, Nature Reviews Microbiology.

[46]  K. Konstantinidis,et al.  Genomic insights that advance the species definition for prokaryotes. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[47]  Michael Roberts,et al.  Reducing storage requirements for biological sequence comparison , 2004, Bioinform..

[48]  J. Priscu,et al.  The distribution of microplankton in the McMurdo Dry Valley Lakes, Antarctica: response to ecosystem legacy or present-day climatic controls? , 2004, Polar Biology.

[49]  A. Wilmotte,et al.  Cyanobacterial Diversity in Natural and Artificial Microbial Mats of Lake Fryxell (McMurdo Dry Valleys, Antarctica): a Morphological and Molecular Approach , 2003, Applied and Environmental Microbiology.

[50]  Jean-Michel Claverie,et al.  A Giant Virus in Amoebae , 2003, Science.

[51]  J. Priscu,et al.  Carbon Transformations in a Perennially Ice-Covered Antarctic Lake , 1999 .

[52]  Andreas Ludwig,et al.  A Fast Adaptive Layout Algorithm for Undirected Graphs , 1994, GD.

[53]  D. Shepard A two-dimensional interpolation function for irregularly-spaced data , 1968, ACM National Conference.