Analysis of sequencing strategies and tools for taxonomic annotation: Defining standards for progressive metagenomics

Metagenomics research has recently thrived due to DNA sequencing technologies improvement, driving the emergence of new analysis tools and the growth of taxonomic databases. However, there is no all-purpose strategy that can guarantee the best result for a given project and there are several combinations of software, parameters and databases that can be tested. Therefore, we performed an impartial comparison, using statistical measures of classification for eight bioinformatic tools and four taxonomic databases, defining a benchmark framework to evaluate each tool in a standardized context. Using in silico simulated data for 16S rRNA amplicons and whole metagenome shotgun data, we compared the results from different software and database combinations to detect biases related to algorithms or database annotation. Using our benchmark framework, researchers can define cut-off values to evaluate the expected error rate and coverage for their results, regardless the score used by each software. A quick guide to select the best tool, all datasets and scripts to reproduce our results and benchmark any new method are available at https://github.com/Ales-ibt/Metagenomic-benchmark. Finally, we stress out the importance of gold standards, database curation and manual inspection of taxonomic profiling results, for a better and more accurate microbial diversity description.

[1]  M. Gorzelak,et al.  Methods for Improving Human Gut Microbiome Data by Reducing Variability through Sample Processing and Storage of Stool , 2015, PloS one.

[2]  Elaina D. Graham,et al.  The reconstruction of 2,631 draft metagenome-assembled genomes from the global oceans , 2017, Scientific Data.

[3]  S. A. Boers,et al.  Micelle PCR reduces chimera formation in 16S rRNA profiling of complex microbial DNA mixtures , 2015, Scientific Reports.

[4]  William A. Walters,et al.  QIIME allows analysis of high-throughput community sequencing data , 2010, Nature Methods.

[5]  P. Bork,et al.  ETE 3: Reconstruction, Analysis, and Visualization of Phylogenomic Data , 2016, Molecular biology and evolution.

[6]  F. Ryan,et al.  SPINGO: a rapid species-classifier for microbial amplicon sequences , 2015, BMC Bioinformatics.

[7]  R. Dewhurst,et al.  Assembly of 913 microbial genomes from metagenomic sequencing of the cow rumen , 2018, Nature Communications.

[8]  Stephen J. Salipante,et al.  Performance Comparison of Illumina and Ion Torrent Next-Generation Sequencing Platforms for 16S rRNA-Based Bacterial Community Profiling , 2014, Applied and Environmental Microbiology.

[9]  Luis M Rodriguez-R,et al.  A user's guide to quantitative and comparative analysis of metagenomic datasets. , 2013, Methods in enzymology.

[10]  Patrick D. Schloss,et al.  Sequencing 16S rRNA gene fragments using the PacBio SMRT DNA sequencing system , 2016, PeerJ.

[11]  B. Haas,et al.  A Catalog of Reference Genomes from the Human Microbiome , 2010, Science.

[12]  Hélène Touzet,et al.  Assessment of Common and Emerging Bioinformatics Pipelines for Targeted Metagenomics , 2017, PloS one.

[13]  Vineet K. Sharma,et al.  Reconstruction of Bacterial and Viral Genomes from Multiple Metagenomes , 2016, Front. Microbiol..

[14]  Eric P. Nawrocki,et al.  An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea , 2011, The ISME Journal.

[15]  Philip D. Blood,et al.  Critical Assessment of Metagenome Interpretation—a benchmark of metagenomics software , 2017, Nature Methods.

[16]  Jennifer M. Fettweis,et al.  The truth about metagenomics: quantifying and counteracting bias in 16S rRNA studies , 2015, BMC Microbiology.

[17]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[18]  A. Klindworth,et al.  Evaluation of general 16S ribosomal RNA gene PCR primers for classical and next-generation sequencing-based diversity studies , 2012, Nucleic acids research.

[19]  P. Baldrian,et al.  The Variability of the 16S rRNA Gene in Bacterial Genomes and Its Consequences for Bacterial Community Analyses , 2013, PloS one.

[20]  Stuart M. Brown,et al.  Diversity of 16S rRNA Genes within Individual Prokaryotic Genomes , 2010, Applied and Environmental Microbiology.

[21]  Lorenzo Segovia,et al.  Protein homology detection and fold inference through multiple alignment entropy profiles , 2007, Proteins.

[22]  Sarah A. Butcher,et al.  k-SLAM: accurate and ultra-fast taxonomic classification and gene identification for large metagenomic data sets , 2016, Nucleic acids research.

[23]  Kang Ning,et al.  Parallel-META 2.0: Enhanced Metagenomic Data Analysis with Functional Annotation, High Performance Computing and Advanced Visualization , 2014, PloS one.

[24]  C. Chothia,et al.  Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[25]  James R. Cole,et al.  Ribosomal Database Project: data and tools for high throughput rRNA analysis , 2013, Nucleic Acids Res..

[26]  Pelin Yilmaz,et al.  The SILVA ribosomal RNA gene database project: improved data processing and web-based tools , 2012, Nucleic Acids Res..

[27]  M. Martínez‐Porchas,et al.  Significant loss of sensitivity and specificity in the taxonomic classification occurs when short 16S rRNA gene sequences are used , 2016, Heliyon.

[28]  Peer Bork,et al.  MOCAT: A Metagenomics Assembly and Gene Prediction Toolkit , 2012, PloS one.

[29]  Zhiheng Pei,et al.  Pearls and pitfalls of genomics-based microbiome analysis , 2012, Emerging Microbes & Infections.

[30]  N. Kyrpides,et al.  Direct Comparisons of Illumina vs. Roche 454 Sequencing Technologies on the Same Microbial Community DNA Sample , 2012, PloS one.

[31]  S. Lonardi,et al.  CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers , 2015, BMC Genomics.

[32]  Paul P. Gardner,et al.  An evaluation of the accuracy and speed of metagenome analysis tools , 2015, Scientific Reports.

[33]  Johan Bengtsson-Palme,et al.  metaxa2: improved identification and taxonomic classification of small and large subunit rRNA in metagenomic data , 2015, Molecular ecology resources.

[34]  Duy Tin Truong,et al.  MetaPhlAn2 for enhanced metagenomic taxonomic profiling , 2015, Nature Methods.

[35]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[36]  A. Sanchez-Flores,et al.  The Road to Metagenomics: From Microbiology to DNA Sequencing Technologies and Bioinformatics , 2015, Front. Genet..

[37]  Jianxin Shi,et al.  Comparison of Collection Methods for Fecal Samples in Microbiome Studies , 2017, American journal of epidemiology.

[38]  J. Clarridge,et al.  Impact of 16S rRNA Gene Sequence Analysis for Identification of Bacteria on Clinical Microbiology and Infectious Diseases , 2004, Clinical Microbiology Reviews.

[39]  Florent E. Angly,et al.  Grinder: a versatile amplicon and shotgun sequence simulator , 2012, Nucleic acids research.

[40]  Kang Ning,et al.  Parallel-META 3: Comprehensive taxonomical and functional analysis platform for efficient comparison of microbial communities , 2017, Scientific Reports.

[41]  Eric Ravussin,et al.  Impact of Different Fecal Processing Methods on Assessments of Bacterial Diversity in the Human Intestine , 2016, Front. Microbiol..

[42]  Stefano Lonardi,et al.  Comprehensive benchmarking and ensemble approaches for metagenomic classifiers , 2017, Genome Biology.

[43]  C. Quince,et al.  Comparative metagenomic and rRNA microbial diversity characterization using archaeal and bacterial synthetic communities. , 2013, Environmental microbiology.

[44]  Donovan H. Parks,et al.  Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life , 2017, Nature Microbiology.

[45]  Steven Salzberg,et al.  BIOINFORMATICS ORIGINAL PAPER , 2004 .

[46]  Caleb Webber,et al.  SCANPS: a web server for iterative protein sequence database searching by dynamic programing, with display in a hierarchical SCOP browser , 2008, Nucleic Acids Res..