Nonpareil 3: Fast Estimation of Metagenomic Coverage and Sequence Diversity

Estimation of the coverage provided by a metagenomic data set, i.e., what fraction of the microbial community was sampled by DNA sequencing, represents an essential first step of every culture-independent genomic study that aims to robustly assess the sequence diversity present in a sample. However, estimation of coverage remains elusive because of several technical limitations associated with high computational requirements and limiting statistical approaches to quantify diversity. Here we described Nonpareil 3, a new bioinformatics algorithm that circumvents several of these limitations and thus can facilitate culture-independent studies in clinical or environmental settings, independent of the sequencing platform employed. In addition, we present a new metric of sequence diversity based on rarefied coverage and demonstrate its use in communities from diverse ecosystems. ABSTRACT Estimations of microbial community diversity based on metagenomic data sets are affected, often to an unknown degree, by biases derived from insufficient coverage and reference database-dependent estimations of diversity. For instance, the completeness of reference databases cannot be generally estimated since it depends on the extant diversity sampled to date, which, with the exception of a few habitats such as the human gut, remains severely undersampled. Further, estimation of the degree of coverage of a microbial community by a metagenomic data set is prohibitively time-consuming for large data sets, and coverage values may not be directly comparable between data sets obtained with different sequencing technologies. Here, we extend Nonpareil, a database-independent tool for the estimation of coverage in metagenomic data sets, to a high-performance computing implementation that scales up to hundreds of cores and includes, in addition, a k-mer-based estimation as sensitive as the original alignment-based version but about three hundred times as fast. Further, we propose a metric of sequence diversity (Nd) derived directly from Nonpareil curves that correlates well with alpha diversity assessed by traditional metrics. We use this metric in different experiments demonstrating the correlation with the Shannon index estimated on 16S rRNA gene profiles and show that Nd additionally reveals seasonal patterns in marine samples that are not captured by the Shannon index and more precise rankings of the magnitude of diversity of microbial communities in different habitats. Therefore, the new version of Nonpareil, called Nonpareil 3, advances the toolbox for metagenomic analyses of microbiomes. IMPORTANCE Estimation of the coverage provided by a metagenomic data set, i.e., what fraction of the microbial community was sampled by DNA sequencing, represents an essential first step of every culture-independent genomic study that aims to robustly assess the sequence diversity present in a sample. However, estimation of coverage remains elusive because of several technical limitations associated with high computational requirements and limiting statistical approaches to quantify diversity. Here we described Nonpareil 3, a new bioinformatics algorithm that circumvents several of these limitations and thus can facilitate culture-independent studies in clinical or environmental settings, independent of the sequencing platform employed. In addition, we present a new metric of sequence diversity based on rarefied coverage and demonstrate its use in communities from diverse ecosystems.

[1]  Michael C. Wendl,et al.  Coverage theories for metagenomic DNA sequencing based on a generalization of Stevens’ theorem , 2012, Journal of Mathematical Biology.

[2]  Sean D. Hooper,et al.  Estimating DNA coverage and abundance in metagenomes using a gamma approximation , 2009, Bioinform..

[3]  Luis M Rodriguez-R,et al.  Estimating coverage in metagenomic data sets and why it matters , 2014, The ISME Journal.

[4]  Michael C. Wendl,et al.  A General Coverage Theory for Shotgun DNA Sequencing , 2006, J. Comput. Biol..

[5]  Katherine H. Huang,et al.  A framework for human microbiome research , 2012, Nature.

[6]  Victor de Lorenzo,et al.  COVER: a priori estimation of coverage for metagenomic sequencing. , 2012, Environmental microbiology reports.

[7]  Konstantinos T. Konstantinidis,et al.  Metagenomic Insights into the Evolution, Function, and Complexity of the Planktonic Microbial Community of Lake Lanier, a Temperate Freshwater Ecosystem , 2011, Applied and Environmental Microbiology.

[8]  E. Lander,et al.  Genomic mapping by fingerprinting random clones: a mathematical analysis. , 1988, Genomics.

[9]  Robert D. Finn,et al.  EBI metagenomics in 2016 - an expanding and evolving resource for the analysis and archiving of metagenomic data , 2015, Nucleic Acids Res..

[10]  James R. Cole,et al.  How Much Do rRNA Gene Surveys Underestimate Extant Bacterial Diversity? , 2018, Applied and Environmental Microbiology.

[11]  Luis Pedro Coelho,et al.  Structure and function of the global ocean microbiome , 2015, Science.

[12]  Patrick J. Biggs,et al.  SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data , 2010, BMC Bioinformatics.

[13]  Warren W. Esty,et al.  The Efficiency of Good's Nonparametric Coverage Estimator , 1986 .

[14]  Katherine H. Huang,et al.  Structure, Function and Diversity of the Healthy Human Microbiome , 2012, Nature.

[15]  Martin Norling,et al.  MetLab: An In Silico Experimental Design, Simulation and Analysis Tool for Viral Metagenomics Studies , 2016, PloS one.

[16]  Timothy Daley,et al.  Predicting the molecular complexity of sequencing libraries , 2013, Nature Methods.

[17]  Luis Miguel Rodriguez-Rojas,et al.  Nonpareil: a redundancy-based approach to assess the level of coverage in metagenomic datasets , 2014, Bioinform..

[18]  Luis M Rodriguez-R,et al.  A user's guide to quantitative and comparative analysis of metagenomic datasets. , 2013, Methods in enzymology.

[19]  Stephen A. Stanhope,et al.  Occupancy Modeling, Maximum Contig Size Probabilities and Designing Metagenomics Experiments , 2010, PloS one.

[20]  William A. Walters,et al.  QIIME allows analysis of high-throughput community sequencing data , 2010, Nature Methods.

[21]  L. Hillier,et al.  Theories and applications for sequencing randomly selected clones. , 2001, Genome research.

[22]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[23]  I. Sanders,et al.  The role of community and population ecology in applying mycorrhizal fungi for improved food security , 2014, The ISME Journal.

[24]  P. Baldrian,et al.  The Variability of the 16S rRNA Gene in Bacterial Genomes and Its Consequences for Bacterial Community Analyses , 2013, PloS one.

[25]  A. Chao,et al.  Nonparametric estimation of Shannon’s index of diversity when there are unseen species in sample , 2004, Environmental and Ecological Statistics.

[26]  A. Chao Nonparametric estimation of the number of classes in a population , 1984 .