A Novel Abundance-Based Algorithm for Binning Metagenomic Sequences Using l-Tuples

Metagenomics is the study of microbial communities sampled directly from their natural environment, without prior culturing. Among the computational tools recently developed for metagenomic sequence analysis, binning tools attempt to classify the sequences in a metagenomic dataset into different bins (i.e., species), based on various DNA composition patterns (e.g., the tetramer frequencies) of various genomes. Composition-based binning methods, however, cannot be used to classify very short fragments, because of the substantial variation of DNA composition patterns within a single genome. We developed a novel approach (AbundanceBin) for metagenomics binning by utilizing the different abundances of species living in the same environment. AbundanceBin is an application of the Lander-Waterman model to metagenomics, which is based on the l-tuple content of the reads. AbundanceBin achieved accurate, unsupervised, clustering of metagenomic sequences into different bins, such that the reads classified in a bin belong to species of identical or very similar abundances in the sample. In addition, AbundanceBin gave accurate estimations of species abundances, as well as their genome sizes-two important parameters for characterizing a microbial community. We also show that AbundanceBin performed well when the sequence lengths are very short (e.g., 75 bp) or have sequencing errors. By combining AbundanceBin and a composition-based method (MetaCluster), we can achieve even higher binning accuracy. Supplementary Material is available at www.liebertonline.com/cmb .

[1]  Frank Oliver Glöckner,et al.  TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences , 2004, BMC Bioinformatics.

[2]  Jean-Michel Claverie,et al.  Taxonomic distribution of large DNA viruses in the sea , 2008, Genome Biology.

[3]  Daniel H. Huson,et al.  MetaSim—A Sequencing Simulator for Genomics and Metagenomics , 2008, PloS one.

[4]  E. Mardis,et al.  An obesity-associated gut microbiome with increased capacity for energy harvest , 2006, Nature.

[5]  Martin Vingron,et al.  TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing , 2002, Bioinform..

[6]  S. Tringe,et al.  Quantitative Phylogenetic Assessment of Microbial Communities in Diverse Environments , 2007, Science.

[7]  D. Bentley,et al.  Whole-genome re-sequencing. , 2006, Current opinion in genetics & development.

[8]  J. Banfield,et al.  Community structure and metabolism through reconstruction of microbial genomes from the environment , 2004, Nature.

[9]  E. Lander,et al.  Genomic mapping by fingerprinting random clones: a mathematical analysis. , 1988, Genomics.

[10]  J. Eisen,et al.  A simple, fast, and accurate method of phylogenomic inference , 2008, Genome Biology.

[11]  D. Alland,et al.  A detailed analysis of 16S ribosomal RNA gene segments for the diagnosis of pathogenic bacteria. , 2007, Journal of microbiological methods.

[12]  Natalia N. Ivanova,et al.  Symbiosis insights through metagenomic analysis of a microbial consortium. , 2006, Nature Reviews Microbiology.

[13]  B. Snel,et al.  Toward Automatic Reconstruction of a Highly Resolved Tree of Life , 2006, Science.

[14]  Robert D. Finn,et al.  Pfam: clans, web tools and services , 2005, Nucleic Acids Res..

[15]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.

[16]  Naryttza N. Diaz,et al.  TACOA – Taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach , 2009, BMC Bioinformatics.

[17]  C. Hutchison DNA sequencing: bench to bedside and beyond , 2007, Nucleic acids research.

[18]  Rick L. Stevens,et al.  Functional metagenomic profiling of nine biomes , 2008, Nature.

[19]  Florent E. Angly,et al.  Microbial Ecology of Four Coral Atolls in the Northern Line Islands , 2008, PloS one.

[20]  S. Salzberg,et al.  Phymm and PhymmBL: Metagenomic Phylogenetic Classification with Interpolated Markov Models , 2009, Nature Methods.

[21]  Naryttza N. Diaz,et al.  Phylogenetic classification of short environmental DNA fragments , 2008, Nucleic acids research.

[22]  Michael Roberts,et al.  Figaro: a novel statistical method for vector sequence removal , 2008, Bioinform..

[23]  Ron Y. Pinter,et al.  A Statistical Framework for the Functional Analysis of Metagenomes , 2008, RECOMB.

[24]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[25]  Michael Y. Galperin,et al.  Metagenomics: from acid mine to shining sea. , 2004, Environmental microbiology.

[26]  J. Parkhill,et al.  Comparative genomic structure of prokaryotes. , 2004, Annual review of genetics.

[27]  Susan M. Huse,et al.  Accuracy and quality of massively parallel DNA pyrosequencing , 2007, Genome Biology.

[28]  M. Waterman,et al.  Estimating the repeat structure and length of DNA sequences using L-tuples. , 2003, Genome research.

[29]  B. Roe,et al.  A core gut microbiome in obese and lean twins , 2008, Nature.

[30]  P. Bork,et al.  Environments shape the nucleotide composition of genomes , 2005, EMBO reports.

[31]  Ying Xu,et al.  Barcodes for genomes and applications , 2008, BMC Bioinformatics.

[32]  S. Tringe,et al.  Comparative Metagenomics of Microbial Communities , 2004, Science.

[33]  Zhaojun Bai,et al.  CompostBin: A DNA Composition-Based Algorithm for Binning Environmental Shotgun Reads , 2007, RECOMB.

[34]  O. Gascuel,et al.  A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. , 2003, Systematic biology.