论文信息 - Estimating coverage in metagenomic data sets and why it matters

Estimating coverage in metagenomic data sets and why it matters

A ‘metagenome' is the theoretical collection of genomes from all members of a given microbial community, and a ‘metagenomic data set' is the subset captured in a given sequencing event. Although these terms are often used interchangeably and metagenomic data sets are regularly called metagenomes by synecdoche, their relationship is analogous to sample and population in statistics. The fraction of the metagenome represented in the metagenomic data set, termed coverage (not to be confused with the repetition of features, termed sequencing depth), is of key importance in assessing statistical significance of features sampled (taxa, genes and so on). However, quantitative computational methods to assess the level of coverage are limited, a problem we have recently attempted to solve. In extreme cases, where small data sets are used to characterize complex communities, misleading inferences can arise. For instance, random variation can be frequently mistaken for real differences in comparisons of metagenomic data sets with extreme differences in coverage. Further, insufficient coverage also reduces the detection limits and statistical power of the comparisons, hiding real, ecologically relevant trends and differences (Figure 1). We demonstrate here how available solutions can determine the level of sequencing coverage obtained by metagenomic data sets and thus, guide their robust analysis and comparison.

Luis M Rodriguez-R | Konstantinos T Konstantinidis | K. Konstantinidis | Luis M. Rodriguez-R | L. M. Rodriguez-R

[1] K. Konstantinidis,et al. Bacterial species may exist, metagenomics reveal. , 2012, Environmental microbiology.

[2] Stephen A. Stanhope,et al. Occupancy Modeling, Maximum Contig Size Probabilities and Designing Metagenomics Experiments , 2010, PloS one.

[3] E. Lander,et al. Genomic mapping by fingerprinting random clones: a mathematical analysis. , 1988, Genomics.

[4] Luis Miguel Rodriguez-Rojas,et al. Nonpareil: a redundancy-based approach to assess the level of coverage in metagenomic datasets , 2014, Bioinform..

[5] Michael C. Wendl,et al. Coverage theories for metagenomic DNA sequencing based on a generalization of Stevens’ theorem , 2012, Journal of Mathematical Biology.

[6] Sean D. Hooper,et al. Estimating DNA coverage and abundance in metagenomes using a gamma approximation , 2009, Bioinform..

[7] Timothy Daley,et al. Predicting the molecular complexity of sequencing libraries , 2013, Nature Methods.

[8] Scott T. Bates,et al. Cross-biome metagenomic analyses of soil microbial communities and their functional attributes , 2012, Proceedings of the National Academy of Sciences.

[9] W. Huber,et al. Differential expression analysis for sequence count data , 2010 .

[10] Martin Hartmann,et al. Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities , 2009, Applied and Environmental Microbiology.