GUNC: detection of chimerism and contamination in prokaryotic genomes

Genomes are critical units in microbiology, yet ascertaining quality in prokaryotic genome assemblies remains a formidable challenge. We present GUNC (the Genome UNClutterer), a tool that accurately detects and quantifies genome chimerism based on the lineage homogeneity of individual contigs using a genome’s full complement of genes. GUNC complements existing approaches by targeting previously underdetected types of contamination: we conservatively estimate that 5.7% of genomes in GenBank, 5.2% in RefSeq, and 15–30% of pre-filtered “high-quality” metagenome-assembled genomes in recent studies are undetected chimeras. GUNC provides a fast and robust tool to substantially improve prokaryotic genome quality.

[1]  Md Tauqeer Alam,et al.  Staphylococcal Protein A (spa) Locus Is a Hot Spot for Recombination and Horizontal Gene Transfer in Staphylococcus pseudintermedius , 2020, mSphere.

[2]  Connor T. Skennerton,et al.  CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes , 2015, Genome research.

[3]  M. Koblížek,et al.  Clustered Core- and Pan-Genome Content on Rhodobacteraceae Chromosomes , 2019, Genome biology and evolution.

[4]  Renan Valieris,et al.  Bioconda: sustainable and comprehensive software distribution for the life sciences , 2018, Nature Methods.

[5]  H. Theil On the Estimation of Relationships Involving Qualitative Variables , 1970, American Journal of Sociology.

[6]  Johannes Alneberg,et al.  DESMAN: a new tool for de novo extraction of strains from metagenomes , 2017, Genome Biology.

[7]  Luis Pedro Coelho,et al.  proGenomes2: an improved database for accurate and consistent habitat, taxonomic and functional annotations of prokaryotic genomes , 2019, Nucleic Acids Res..

[8]  G. Basharin On a Statistical Estimate for the Entropy of a Sequence of Independent Random Variables , 1959 .

[9]  Elaina D. Graham,et al.  290 metagenome-assembled genomes from the Mediterranean Sea: a resource for marine microbiology , 2017, bioRxiv.

[10]  Anders F. Andersson,et al.  Binning metagenomic contigs by coverage and composition , 2014, Nature Methods.

[11]  Thijs J. G. Ettema,et al.  Asgard archaea illuminate the origin of eukaryotic cellular complexity , 2017, Nature.

[12]  Donovan H. Parks,et al.  Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life , 2017, Nature Microbiology.

[13]  Jens Roat Kultima,et al.  Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes , 2014, Nature Biotechnology.

[14]  Feng Li,et al.  MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies , 2019, PeerJ.

[15]  James Bailey,et al.  Information theoretic measures for clusterings comparison: is a correction for chance necessary? , 2009, ICML '09.

[16]  Steven L. Salzberg,et al.  Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank , 2020, Genome Biology.

[17]  Sung-Hou Kim,et al.  Global extent of horizontal gene transfer , 2007, Proceedings of the National Academy of Sciences.

[18]  Michael Y. Galperin,et al.  Prokaryotic genomes: the emerging paradigm of genome-based microbiology. , 1997, Current opinion in genetics & development.

[19]  Brian C. Thomas,et al.  Unusual biology across a group comprising more than 15% of domain Bacteria , 2015, Nature.

[20]  Jillian F. Banfield,et al.  Community genomics in microbial ecology and evolution , 2005, Nature Reviews Microbiology.

[21]  Wen J. Li,et al.  RefSeq: an update on prokaryotic genome annotation and curation , 2017, Nucleic Acids Res..

[22]  Donovan H. Parks,et al.  A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life , 2018, Nature Biotechnology.

[23]  R. Fleischmann,et al.  Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. , 1995, Science.

[24]  J. Banfield,et al.  InStrain enables population genomic analysis from metagenomic data and rigorous detection of identical microbial strains , 2020, bioRxiv.

[25]  K. Schleifer,et al.  Phylogenetic identification and in situ detection of individual microbial cells without cultivation. , 1995, Microbiological reviews.

[26]  J. Banfield,et al.  Community structure and metabolism through reconstruction of microbial genomes from the environment , 2004, Nature.

[27]  Jonathan A. Eisen,et al.  Solagigasbacteria: Lone genomic giants among the uncultured bacterial phyla , 2017, bioRxiv.

[28]  Natalia N. Ivanova,et al.  Insights into the phylogeny and coding potential of microbial dark matter , 2013, Nature.

[29]  Bas E. Dutilh,et al.  Robust taxonomic classification of uncharted microbial sequences and bins with CAT and BAT , 2019, Genome Biology.

[30]  Donovan H. Parks,et al.  A complete domain-to-species taxonomy for Bacteria and Archaea , 2020, Nature Biotechnology.

[31]  Katherine S. Pollard,et al.  New insights from uncultivated genomes of the global human gut microbiome , 2019, Nature.

[32]  J. Banfield,et al.  Accurate and complete genomes from metagenomes , 2019, bioRxiv.

[33]  Evgeny M. Zdobnov,et al.  BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs , 2015, Bioinform..

[34]  L. Vielva,et al.  Pathways for horizontal gene transfer in bacteria revealed by a global map of their plasmids , 2020, Nature Communications.

[35]  Tullis Onstott,et al.  Rokubacteria: Genomic Giants among the Uncultured Bacterial Phyla , 2017, Front. Microbiol..

[36]  K. Schleifer,et al.  Phylogenetic identification and in situ detection of individual microbial cells without cultivation , 1995 .

[37]  Robert D. Finn,et al.  A new genomic blueprint of the human gut microbiota , 2019, Nature.

[38]  Mateo Rojas-Carulla,et al.  DeepMAsED: Evaluating the quality of metagenomic assemblies , 2019, bioRxiv.

[39]  H. Ochman,et al.  Lateral gene transfer and the nature of bacterial innovation , 2000, Nature.

[40]  P. Bork,et al.  Diversity within species: interpreting strains in microbiomes , 2020, Nature Reviews Microbiology.

[41]  Brian C. Thomas,et al.  A new view of the tree of life , 2016, Nature Microbiology.

[42]  Blake A. Simmons,et al.  MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets , 2016, Bioinform..

[43]  M. Mirdita,et al.  Fast and sensitive taxonomic assignment to metagenomic contigs , 2020, bioRxiv.

[44]  Natalia N. Ivanova,et al.  Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea , 2017, Nature Biotechnology.

[45]  Bernhard Schölkopf,et al.  DeepMAsED: Evaluating the quality of metagenomic assemblies. , 2020, Bioinformatics.

[46]  Patrick D. Schloss,et al.  Status of the Archaeal and Bacterial Census: an Update , 2016, mBio.

[47]  Qiyu Bao,et al.  Comparative genomics analysis of pKF3-94 in Klebsiella pneumoniae reveals plasmid compatibility and horizontal gene transfer , 2015, Front. Microbiol..

[48]  Edoardo Pasolli,et al.  Extensive Unexplored Human Microbiome Diversity Revealed by Over 150,000 Genomes from Metagenomes Spanning Age, Geography, and Lifestyle , 2019, Cell.

[49]  Dennis A. Benson,et al.  GenBank , 2017, Nucleic Acids Res..

[50]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[51]  Minping Qian,et al.  Flexibility and Symmetry of Prokaryotic Genome Rearrangement Reveal Lineage-Associated Core-Gene-Defined Genome Organizational Frameworks , 2014, mBio.

[52]  Miriam L. Land,et al.  Trace: Tennessee Research and Creative Exchange Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification Recommended Citation Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification , 2022 .