Coverage theories for metagenomic DNA sequencing based on a generalization of Stevens’ theorem

Metagenomic project design has relied variously upon speculation, semi-empirical and ad hoc heuristic models, and elementary extensions of single-sample Lander–Waterman expectation theory, all of which are demonstrably inadequate. Here, we propose an approach based upon a generalization of Stevens’ Theorem for randomly covering a domain. We extend this result to account for the presence of multiple species, from which are derived useful probabilities for fully recovering a particular target microbe of interest and for average contig length. These show improved specificities compared to older measures and recommend deeper data generation than the levels chosen by some early studies, supporting the view that poor assemblies were due at least somewhat to insufficient data. We assess predictions empirically by generating roughly 4.5 Gb of sequence from a twelve member bacterial community, comparing coverage for two particular members, Selenomonas artemidis and Enterococcus faecium, which are the least ($$\sim $$3 %) and most ($$\sim $$12 %) abundant species, respectively. Agreement is reasonable, with differences likely attributable to coverage biases. We show that, in some cases, bias is simple in the sense that a small reduction in read length to simulate less efficient covering brings data and theory into essentially complete accord. Finally, we describe two applications of the theory. One plots coverage probability over the relevant parameter space, constructing essentially a “metagenomic design map” to enable straightforward analysis and design of future projects. The other gives an overview of the data requirements for various types of sequencing milestones, including a desired number of contact reads and contig length, for detection of a rare viral species.

[1]  Michael C. Wendl,et al.  A General Coverage Theory for Shotgun DNA Sequencing , 2006, J. Comput. Biol..

[2]  Li C. Xia,et al.  Accurate Genome Relative Abundance Estimation Based on Shotgun Metagenomic Reads , 2011, PloS one.

[3]  Feller William,et al.  An Introduction To Probability Theory And Its Applications , 1950 .

[4]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1967 .

[5]  Jonathan A Eisen,et al.  Environmental Shotgun Sequencing: Its Potential and Challenges for Studying the Hidden World of Microbes , 2007, PLoS biology.

[6]  L. Clarke,et al.  A colony bank containing synthetic Col El hybrid plasmids representative of the entire E. coli genome. 1976. , 1992, Biotechnology.

[7]  P. Salamon,et al.  Metagenomic Analyses of an Uncultured Viral Community from Human Feces , 2003, Journal of bacteriology.

[8]  L. Hillier,et al.  Theories and applications for sequencing randomly selected clones. , 2001, Genome research.

[9]  John C. Wooley,et al.  A Primer on Metagenomics , 2010, PLoS Comput. Biol..

[10]  S. Tringe,et al.  Quantitative Phylogenetic Assessment of Microbial Communities in Diverse Environments , 2007, Science.

[11]  Timothy B. Stockwell,et al.  Evaluation of next generation sequencing platforms for population targeted sequencing studies , 2009, Genome Biology.

[12]  Edward F. DeLong,et al.  Microbial community genomics in the ocean , 2005, Nature Reviews Microbiology.

[13]  Michael C. Wendl,et al.  Extension of Lander-Waterman theory for sequencing filtered DNA libraries , 2005, BMC Bioinformatics.

[14]  Lior Pachter,et al.  Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities , 2005, PLoS Comput. Biol..

[15]  Sean D. Hooper,et al.  Estimating DNA coverage and abundance in metagenomes using a gamma approximation , 2009, Bioinform..

[16]  A. Halpern,et al.  The Sorcerer II Global Ocean Sampling Expedition: Northwest Atlantic through Eastern Tropical Pacific , 2007, PLoS biology.

[17]  H. Nicholls Sorcerer II: The Search for Microbial Diversity Roils the Waters , 2007, PLoS biology.

[18]  E. Lander,et al.  Genomic mapping by fingerprinting random clones: a mathematical analysis. , 1988, Genomics.

[19]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[20]  John Carbon,et al.  A colony bank containing synthetic CoI EI hybrid plasmids representative of the entire E. coli genome , 1976, Cell.

[21]  Michael C Wendl,et al.  Generalized gap model for bacterial artificial chromosome clone fingerprint mapping and shotgun sequencing. , 2002, Genome research.

[22]  Jo Handelsman,et al.  A Census of rRNA Genes and Linked Genomic Sequences within a Soil Metagenomic Library , 2003, Applied and Environmental Microbiology.

[23]  Naryttza N. Diaz,et al.  The metagenome of a biogas-producing microbial community of a production-scale biogas plant fermenter analysed by the 454-pyrosequencing technology. , 2008, Journal of biotechnology.

[24]  Stephen A. Stanhope,et al.  Occupancy Modeling, Maximum Contig Size Probabilities and Designing Metagenomics Experiments , 2010, PloS one.

[25]  J. Roach Random subcloning. , 1995, Genome research.

[26]  Michael C. Wendl,et al.  Random Covering of Multiple One-Dimensional Domains with an Application to DNA Sequencing , 2008, SIAM J. Appl. Math..

[27]  Florent E. Angly,et al.  The Marine Viromes of Four Oceanic Regions , 2006, PLoS biology.

[28]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[29]  S. Tringe,et al.  Metagenomic Discovery of Biomass-Degrading Genes and Genomes from Cow Rumen , 2011, Science.

[30]  J. Roach,et al.  Pairwise end sequencing: a unified approach to genomic mapping and sequencing. , 1995, Genomics.

[31]  E. Green Strategies for the systematic sequencing of complex genomes , 2001, Nature Reviews Genetics.

[32]  Richard K. Wilson,et al.  Aspects of coverage in medical DNA sequencing , 2008, BMC Bioinformatics.

[33]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[34]  W. Stevens SOLUTION TO A GEOMETRICAL PROBLEM IN PROBABILITY , 1939 .

[35]  Kun Zhang,et al.  Finding the Needles in the Metagenome Haystack , 2007, Microbial Ecology.

[36]  Bas E. Dutilh,et al.  Increasing the coverage of a metapopulation consensus genome by iterative read mapping and assembly , 2009, Bioinform..

[37]  Jillian F. Banfield,et al.  Community genomics in microbial ecology and evolution , 2005, Nature Reviews Microbiology.

[38]  B. Andresen,et al.  Genomic analysis of uncultured marine viral communities , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[39]  Curtis A Suttle,et al.  Metagenomic Analysis of Coastal RNA Virus Communities , 2006, Science.

[40]  M. Pop,et al.  Metagenomic Analysis of the Human Distal Gut Microbiome , 2006, Science.

[41]  R. Fleischmann,et al.  Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. , 1995, Science.

[42]  E. Koonin,et al.  Bacterial rhodopsin: evidence for a new type of phototrophy in the sea. , 2000, Science.

[43]  Alla Lapidus,et al.  A Bioinformatician's Guide to Metagenomics , 2008, Microbiology and Molecular Biology Reviews.

[44]  Michael C Wendl,et al.  The theory of discovering rare variants via DNA sequencing , 2009, BMC Genomics.

[45]  R. Fisher ON THE SIMILARITY OF THE DISTRIBUTIONS FOUND FOR THE TEST OF SIGNIFICANCE IN HARMONIC ANALYSIS, AND IN STEVENS'S PROBLEM IN GEOMETRICAL PROBABILITY , 1940 .

[46]  M. Wendl Occupancy Modeling of Coverage Distribution for Whole Genome Shotgun Dna Sequencing , 2006, Bulletin of mathematical biology.

[47]  S. Tringe,et al.  Comparative Metagenomics of Microbial Communities , 2004, Science.

[48]  Sophie Schbath,et al.  Coverage Processes in Physical Mapping by Anchoring Random Clones , 1997, J. Comput. Biol..

[49]  J. Handelsman,et al.  Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. , 1998, Chemistry & biology.

[50]  W Miller,et al.  Analysis of the quality and utility of random shotgun sequencing at low redundancies. , 1998, Genome research.

[51]  William H. Beyer,et al.  CRC standard mathematical tables , 1976 .

[52]  Stephen C. J. Parker,et al.  Accurate and comprehensive sequencing of personal genomes. , 2011, Genome research.

[53]  Life Technologies,et al.  A map of human genome variation from population-scale sequencing , 2011 .

[54]  J. Handelsman,et al.  Metagenomics: genomic analysis of microbial communities. , 2004, Annual review of genetics.

[55]  Christopher Quince,et al.  A Comparison of rpoB and 16S rRNA as Markers in Pyrosequencing Studies of Bacterial Diversity , 2012, PloS one.

[56]  Natalia Ivanova,et al.  Metagenomic analysis of two enhanced biological phosphorus removal (EBPR) sludge communities , 2006, Nature Biotechnology.

[57]  P. Bork,et al.  A human gut microbial gene catalogue established by metagenomic sequencing , 2010, Nature.

[58]  O. White,et al.  Environmental Genome Shotgun Sequencing of the Sargasso Sea , 2004, Science.

[59]  Michael C Wendl,et al.  Statistical aspects of discerning indel-type structural variation via DNA sequence alignment , 2009, BMC Genomics.

[60]  J. Banfield,et al.  Community structure and metabolism through reconstruction of microbial genomes from the environment , 2004, Nature.

[61]  M S Waterman,et al.  Genomic mapping by end-characterized random clones: a mathematical analysis. , 1995, Genomics.