Comprehensive Benchmarking and Ensemble Approaches for Metagenomic Classifiers

Background One of the main challenges in metagenomics is the identification of microorganisms in clinical and environmental samples. While an extensive and heterogeneous set of computational tools is available to classify microorganisms using whole genome shotgun sequencing data, comprehensive comparisons of these methods are limited. In this study, we use the largest (n=35) to date set of laboratory-generated and simulated controls across 846 species to evaluate the performance of eleven metagenomics classifiers. We also assess the effects of filtering and combining tools to reduce the number of false positives. Results Tools were characterized on the basis of their ability to (1) identify taxa at the genus, species, and strain levels, (2) quantify relative abundance measures of taxa, and (3) classify individual reads to the species level. Strikingly, the number of species identified by the eleven tools can differ by over three orders of magnitude on the same datasets. However, various strategies can ameliorate taxonomic misclassification, including abundance filtering, ensemble approaches, and tool intersection. Indeed, leveraging tools with different heuristics is beneficial for improved precision. Nevertheless, these strategies were often insufficient to completely eliminate false positives from environmental samples, which are especially important where they concern medically relevant species and where customized tools may be required. Conclusions The results of this study provide positive controls, titrated standards, and a guide for selecting tools for metagenomic analyses by comparing ranges of precision and recall. We show that proper experimental design and analysis parameters, including depth of sequencing, choice of classifier or classifiers, database size, and filtering, can reduce false positives, provide greater resolution of species in complex metagenomic samples, and improve the interpretation of results.

[1]  Diogo M. Camacho,et al.  Wisdom of crowds for robust gene network inference , 2012, Nature Methods.

[2]  Emmanuel Dias-Neto,et al.  The Metagenomics and Metadesign of the Subways and Urban Biomes (MetaSUB) International Consortium inaugural meeting report , 2016, Microbiome.

[3]  B. Tjaden,et al.  De novo assembly of bacterial transcriptomes from RNA-seq data , 2015, Genome Biology.

[4]  Timothy L. Tickle,et al.  Dysfunction of the intestinal microbiome in inflammatory bowel disease and treatment , 2012, Genome Biology.

[5]  G. Gloor,et al.  Human milk microbiota profiles in relation to birthing method, gestation and infant gender , 2016, Microbiome.

[6]  The distribution, diversity, and importance of 16S rRNA gene introns in the order Thermoproteales , 2015, Biology Direct.

[7]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[8]  Duy Tin Truong,et al.  Strain-level microbial epidemiology and population genomics from shotgun metagenomics , 2016, Nature Methods.

[9]  Yaniv Erlich,et al.  Using mobile sequencers in an academic classroom , 2016, eLife.

[10]  Gail L. Rosen,et al.  Quikr: a method for rapid reconstruction of bacterial communities via compressive sensing , 2013, Bioinform..

[11]  J. Bengtsson-Palme,et al.  Evaluating and optimizing the performance of software commonly used for the taxonomic classification of DNA metabarcoding sequence data , 2017, Molecular ecology resources.

[12]  David Page,et al.  Area under the Precision-Recall Curve: Point Estimates and Confidence Intervals , 2013, ECML/PKDD.

[13]  A. Anesio,et al.  Polar Marine Microorganisms and Climate Change. , 2016, Advances in microbial physiology.

[14]  Alexandru I. Tomescu,et al.  MetaFlow: Metagenomic profiling based on whole-genome coverage analysis with min-cost flows , 2016 .

[15]  Noah Alexander,et al.  Geospatial Resolution of Human and Bacterial Diversity with City-Scale Metagenomics , 2015, Cell systems.

[16]  K. Aagaard,et al.  Maturation of the Infant Microbiome Community Structure and Function Across Multiple Body Sites and in Relation to Mode of Delivery , 2017, Nature Medicine.

[17]  Gail L. Rosen,et al.  NBC: the Naïve Bayes Classification tool webserver for taxonomic classification of metagenomic reads , 2010, Bioinform..

[18]  John P Sumpter,et al.  Populations of a cyprinid fish are self-sustaining despite widespread feminization of males , 2014, BMC Biology.

[19]  S. Jaenicke,et al.  Comparative metagenomics of biogas-producing microbial communities from production-scale biogas plants operating under wet or dry fermentation conditions , 2015, Biotechnology for Biofuels.

[20]  J. Lennon,et al.  Scaling laws predict global microbial diversity , 2016, Proceedings of the National Academy of Sciences.

[21]  Duy Tin Truong,et al.  MetaPhlAn2 for enhanced metagenomic taxonomic profiling , 2015, Nature Methods.

[22]  Stefano Lonardi,et al.  Higher Classification Accuracy of Short Metagenomic Reads by Discriminative Spaced k-mers , 2015, WABI.

[23]  Joseph L DeRisi,et al.  Actionable diagnosis of neuroleptospirosis by next-generation sequencing. , 2014, The New England journal of medicine.

[24]  Minh Duc Cao,et al.  Scaffolding and completing genome assemblies in real-time with nanopore sequencing , 2016, Nature Communications.

[25]  J. Rose,et al.  Climate variability and change in the United States: potential impacts on water- and foodborne diseases caused by microbiologic agents. , 2001, Environmental health perspectives.

[26]  Rob Knight,et al.  The Earth Microbiome project: successes and aspirations , 2014, BMC Biology.

[27]  Rob Knight,et al.  ConStrains identifies microbial strains in metagenomic datasets , 2015, Nature Biotechnology.

[28]  Phelim Bradley,et al.  Rapid antibiotic-resistance predictions from genome sequence data for Staphylococcus aureus and Mycobacterium tuberculosis , 2015, Nature Communications.

[29]  P. Bork,et al.  A human gut microbial gene catalogue established by metagenomic sequencing , 2010, Nature.

[30]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.

[31]  Paul P. Gardner,et al.  An evaluation of the accuracy and speed of metagenome analysis tools , 2015, Scientific Reports.

[32]  S. Green,et al.  Next Generation Sequencing and the Extreme Microbiome Project (XMP) , 2015 .

[33]  Hailong Zhu,et al.  Predicting protein functions using incomplete hierarchical labels , 2015, BMC Bioinformatics.

[34]  M. Forsman,et al.  Scaffolding of a bacterial genome using MinION nanopore sequencing , 2015, Scientific Reports.

[35]  Holly M. Bik,et al.  PhyloSift: phylogenetic analysis of genomes and metagenomes , 2014, PeerJ.

[36]  Maya Gokhale,et al.  Scalable metagenomic taxonomy classification using a reference genome database , 2013, Bioinform..

[37]  NBC update: The addition of viral and fungal databases to the Naïve Bayes classification tool , 2012, BMC Research Notes.

[38]  R. Knight,et al.  Avoiding Pandemic Fears in the Subway and Conquering the Platypus , 2016, mSystems.

[39]  Shibu Yooseph,et al.  A Metagenomic Framework for the Study of Airborne Microbial Communities , 2013, PloS one.

[40]  Paul Turner,et al.  Reagent and laboratory contamination can critically impact sequence-based microbiome analyses , 2014, BMC Biology.

[41]  N. D. Clarke,et al.  Towards a Rigorous Assessment of Systems Biology Models: The DREAM3 Challenges , 2010, PloS one.

[42]  Shawn Levy,et al.  International Standards for Genomes, Transcriptomes, and Metagenomes. , 2017, Journal of biomolecular techniques : JBT.

[43]  Ruth Hershberg,et al.  Marker genes that are less conserved in their sequences are useful for predicting genome-wide similarity levels between closely related prokaryotic strains , 2016, Microbiome.

[44]  Yaniv Erlich A vision for ubiquitous sequencing , 2015, bioRxiv.

[45]  Roy D. Welch,et al.  Complete genome sequence of the myxobacterium Sorangium cellulosum , 2007, Nature Biotechnology.

[46]  G. Casella,et al.  Pyrosequencing enumerates and contrasts soil microbial diversity , 2007, The ISME Journal.

[47]  B. Haas,et al.  A Catalog of Reference Genomes from the Human Microbiome , 2010, Science.

[48]  Christopher L. Hemme,et al.  Comparative metagenomics reveals impact of contaminants on groundwater microbiomes , 2015, Front. Microbiol..

[49]  S. Lonardi,et al.  CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers , 2015, BMC Genomics.

[50]  Po-E Li,et al.  Accurate read-based metagenome characterization using a hierarchical suite of unique signatures , 2015, Nucleic acids research.

[51]  F. Bushman,et al.  Viral Metagenomics Reveal Blooms of Anelloviruses in the Respiratory Tract of Lung Transplant Recipients , 2015, American journal of transplantation : official journal of the American Society of Transplantation and the American Society of Transplant Surgeons.

[52]  H. Ochman,et al.  Unexplored Archaeal Diversity in the Great Ape Gut Microbiome , 2017, mSphere.

[53]  L. Dijkshoorn,et al.  Strain, clone and species: comments on three basic concepts of bacteriology. , 2000, Journal of medical microbiology.

[54]  Rebecca F. Halperin,et al.  GuiTope: an application for mapping random-sequence peptides to protein sequences , 2012, BMC Bioinformatics.

[55]  Stefano Lonardi,et al.  Higher classification sensitivity of short metagenomic reads with CLARK-S , 2016, bioRxiv.

[56]  Michael P. Cummings,et al.  A comparative evaluation of sequence classification programs , 2012, BMC Bioinformatics.

[57]  Raymond Lo,et al.  Evaluation of shotgun metagenomics sequence classification methods using in silico and in vitro simulated communities , 2015, BMC Bioinformatics.

[58]  Steven Salzberg,et al.  Bracken: Estimating species abundance in metagenomics data , 2016, bioRxiv.

[59]  Nicolas Parisot Détermination de sondes oligonucléotidiques pour l'exploration à haut débit de la diversité taxonomique et fonctionnelle d'environnements complexes , 2014 .

[60]  John D. Spengler,et al.  Urban Transit System Microbial Communities Differ by Surface Type and Interaction with Humans and the Environment , 2016, mSystems.

[61]  Joel Ackelsberg,et al.  Lack of Evidence for Plague or Anthrax on the New York City Subway. , 2015, Cell systems.

[62]  Dominique Lavenier,et al.  Critical Assessment of Metagenome Interpretation – a benchmark of computational metagenomics software , 2017, bioRxiv.

[63]  S. Schuster,et al.  Integrative analysis of environmental sequences using MEGAN4. , 2011, Genome research.

[64]  Leping Li,et al.  ART: a next-generation sequencing read simulator , 2012, Bioinform..

[65]  David Page,et al.  Area under the Precision-Recall Curve: Point Estimates and Confidence Intervals , 2013, ECML/PKDD.

[66]  Sandeep J. Joseph,et al.  Searching for anthrax in the New York City subway metagenome. , 2015 .

[67]  Chao Xie,et al.  Fast and sensitive protein alignment using DIAMOND , 2014, Nature Methods.

[68]  S. Goodison,et al.  16S ribosomal DNA amplification for phylogenetic study , 1991, Journal of bacteriology.

[69]  Marco Beccuti,et al.  Sequencing of 15 622 gene‐bearing BACs clarifies the gene‐dense regions of the barley genome , 2015, The Plant journal : for cell and molecular biology.

[70]  C. Huttenhower,et al.  The microbiome quality control project: baseline study design and future directions , 2015, Genome Biology.

[71]  Alexandru I. Tomescu,et al.  MetaFlow: Metagenomic profiling based on whole-genome coverage analysis with min-cost flows , 2016, bioRxiv.