A robust and accurate binning algorithm for metagenomic sequences with arbitrary species abundance ratio

MOTIVATION With the rapid development of next-generation sequencing techniques, metagenomics, also known as environmental genomics, has emerged as an exciting research area that enables us to analyze the microbial environment in which we live. An important step for metagenomic data analysis is the identification and taxonomic characterization of DNA fragments (reads or contigs) resulting from sequencing a sample of mixed species. This step is referred to as 'binning'. Binning algorithms that are based on sequence similarity and sequence composition markers rely heavily on the reference genomes of known microorganisms or phylogenetic markers. Due to the limited availability of reference genomes and the bias and low availability of markers, these algorithms may not be applicable in all cases. Unsupervised binning algorithms which can handle fragments from unknown species provide an alternative approach. However, existing unsupervised binning algorithms only work on datasets either with balanced species abundance ratios or rather different abundance ratios, but not both. RESULTS In this article, we present MetaCluster 3.0, an integrated binning method based on the unsupervised top--down separation and bottom--up merging strategy, which can bin metagenomic fragments of species with very balanced abundance ratios (say 1:1) to very different abundance ratios (e.g. 1:24) with consistently higher accuracy than existing methods. AVAILABILITY MetaCluster 3.0 can be downloaded at http://i.cs.hku.hk/~alse/MetaCluster/.

[1]  B. Chor,et al.  Genomic DNA k-mer spectra: models and modalities , 2009, Genome Biology.

[2]  Siu-Ming Yiu,et al.  Unsupervised binning of environmental genomic fragments based on an error robust selection of l-mers , 2009, BMC Bioinformatics.

[3]  S Karlin,et al.  Compositional biases of bacterial genomes and evolutionary implications , 1997, Journal of bacteriology.

[4]  S Karlin,et al.  Comparisons of eukaryotic genomic sequences. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Siu-Ming Yiu,et al.  MetaCluster: unsupervised binning of environmental genomic fragments and taxonomic annotation , 2010, BCB '10.

[6]  James R. Cole,et al.  The Ribosomal Database Project (RDP-II): sequences and tools for high-throughput rRNA analysis , 2004, Nucleic Acids Res..

[7]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .

[8]  References , 1971 .

[9]  S. Karlin,et al.  Dinucleotide relative abundance extremes: a genomic signature. , 1995, Trends in genetics : TIG.

[10]  Ying Xu,et al.  Barcodes for genomes and applications , 2008, BMC Bioinformatics.

[11]  S. Tringe,et al.  Comparative Metagenomics of Microbial Communities , 2004, Science.

[12]  A. Salamov,et al.  Use of simulated data sets to evaluate the fidelity of metagenomic processing methods , 2007, Nature Methods.

[13]  R. Amann,et al.  Combination of 16S rRNA-targeted oligonucleotide probes with flow cytometry for analyzing mixed microbial populations , 1990, Applied and environmental microbiology.

[14]  Zhaojun Bai,et al.  CompostBin: A DNA Composition-Based Algorithm for Binning Environmental Shotgun Reads , 2007, RECOMB.

[15]  Natalia Ivanova,et al.  Metagenomic analysis of two enhanced biological phosphorus removal (EBPR) sludge communities , 2006, Nature Biotechnology.

[16]  Frank Oliver Glöckner,et al.  TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences , 2004, BMC Bioinformatics.

[17]  R. Graham,et al.  Spearman's Footrule as a Measure of Disarray , 1977 .

[18]  Yu-Wei Wu,et al.  A Novel Abundance-Based Algorithm for Binning Metagenomic Sequences Using l-Tuples , 2010, RECOMB.

[19]  Saman K. Halgamuge,et al.  BMC Bioinformatics BioMed Central Methodology article Binning sequences using very sparse labels within a metagenome , 2008 .

[20]  Yan Boucher,et al.  Use of 16S rRNA and rpoB Genes as Molecular Markers for Microbial Ecology Studies , 2006, Applied and Environmental Microbiology.

[21]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.

[22]  Colin Hill,et al.  Functional and comparative metagenomic analysis of bile salt hydrolase activity in the human gut microbiome , 2008, Proceedings of the National Academy of Sciences.

[23]  R. Amann,et al.  Application of tetranucleotide frequencies for the assignment of genomic fragments. , 2004, Environmental microbiology.

[24]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[25]  P. Bork,et al.  A human gut microbial gene catalogue established by metagenomic sequencing , 2010, Nature.

[26]  Jonathan Dushoff,et al.  Unsupervised statistical clustering of environmental shotgun sequences , 2009, BMC Bioinformatics.

[27]  M. Kendall A NEW MEASURE OF RANK CORRELATION , 1938 .

[28]  Rustam I. Aminov,et al.  Predominant Role of Host Genetics in Controlling the Composition of Gut Microbiota , 2008, PloS one.

[29]  J. Banfield,et al.  Community structure and metabolism through reconstruction of microbial genomes from the environment , 2004, Nature.

[30]  O. White,et al.  Environmental Genome Shotgun Sequencing of the Sargasso Sea , 2004, Science.