Estimating DNA coverage and abundance in metagenomes using a gamma approximation

Motivation: Shotgun sequencing generates large numbers of short DNA reads from either an isolated organism or, in the case of metagenomics projects, from the aggregate genome of a microbial community. These reads are then assembled based on overlapping sequences into larger, contiguous sequences (contigs). The feasibility of assembly and the coverage achieved (reads per nucleotide or distinct sequence of nucleotides) depend on several factors: the number of reads sequenced, the read length and the relative abundances of their source genomes in the microbial community. A low coverage suggests that most of the genomic DNA in the sample has not been sequenced, but it is often difficult to estimate either the extent of the uncaptured diversity or the amount of additional sequencing that would be most efficacious. In this work, we regard a metagenome as a population of DNA fragments (bins), each of which may be covered by one or more reads. We employ a gamma distribution to model this bin population due to its flexibility and ease of use. When a gamma approximation can be found that adequately fits the data, we may estimate the number of bins that were not sequenced and that could potentially be revealed by additional sequencing. We evaluated the performance of this model using simulated metagenomes and demonstrate its applicability on three recent metagenomic datasets. Contact: sean.d.hooper@genpat.uu.se Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  R. Izsák,et al.  Maximum likelihood fitting of the Poisson lognormal distribution , 2008, Environmental and Ecological Statistics.

[2]  J. Handelsman,et al.  Introducing DOTUR, a Computer Program for Defining Operational Taxonomic Units and Estimating Species Richness , 2005, Applied and Environmental Microbiology.

[3]  Victor Markowitz,et al.  High-resolution metagenomics targets specific functional types in complex microbial communities , 2008, Nature Biotechnology.

[4]  B. Andresen,et al.  Genomic analysis of uncultured marine viral communities , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[5]  O. White,et al.  Environmental Genome Shotgun Sequencing of the Sargasso Sea , 2004, Science.

[6]  A. Chao Nonparametric estimation of the number of classes in a population , 1984 .

[7]  Daniel H. Huson,et al.  MetaSim—A Sequencing Simulator for Genomics and Metagenomics , 2008, PloS one.

[8]  Christopher Quince,et al.  The rational exploration of microbial diversity , 2008, The ISME Journal.

[9]  S. Tringe,et al.  Comparative Metagenomics of Microbial Communities , 2004, Science.

[10]  Devdatt P. Dubhashi,et al.  Bayesian classifiers for detecting HGT using fixed and variable order markov models of genomic signatures , 2006, Bioinform..

[11]  P. Bork,et al.  Get the most out of your metagenome: computational analysis of environmental sequence data. , 2007, Current opinion in microbiology.

[12]  Lenwood S. Heath,et al.  Genomic Signatures in De Bruijn Chains , 2007, WABI.

[13]  E. Lander,et al.  Genomic mapping by fingerprinting random clones: a mathematical analysis. , 1988, Genomics.

[14]  M. Wendl Occupancy Modeling of Coverage Distribution for Whole Genome Shotgun Dna Sequencing , 2006, Bulletin of mathematical biology.

[15]  W. Brass,et al.  SIMPLIFIED METHODS OF FITTING THE TRUNCATED NEGATIVE BINOMIAL DISTRIBUTION , 1958 .

[16]  Natalia N. Ivanova,et al.  Metagenomic and functional analysis of hindgut microbiota of a wood-feeding higher termite , 2007, Nature.

[17]  E. Delong,et al.  Characterization of uncultivated prokaryotes: isolation and analysis of a 40-kilobase-pair genome fragment from a planktonic marine archaeon , 1996, Journal of bacteriology.

[18]  John Bunge,et al.  Estimating the Number of Species in a Stochastic Abundance Model , 2002, Biometrics.

[19]  A Chao,et al.  Estimating population size for capture-recapture data when capture probabilities vary by time and individual animal. , 1992, Biometrics.

[20]  Peer Bork,et al.  Millimeter-scale genetic gradients and community-level molecular convergence in a hypersaline microbial mat , 2008, Molecular systems biology.

[21]  Richard Durbin,et al.  A large genome center's improvements to the Illumina sequencing system , 2008, Nature Methods.

[22]  A. Salamov,et al.  Use of simulated data sets to evaluate the fidelity of metagenomic processing methods , 2007, Nature Methods.

[23]  Peter Salamon,et al.  PHACCS, an online tool for estimating the structure and diversity of uncultured viral communities using metagenomic information , 2005, BMC Bioinformatics.

[24]  A. El-Shaarawi,et al.  Some goodness-of-fit methods for the Poisson plus added zeros distribution , 1985, Applied and environmental microbiology.

[25]  T. Wetter,et al.  Using the miraEST assembler for reliable and automated mRNA transcript assembly and SNP detection in sequenced ESTs. , 2004, Genome research.