MetaCluster 4.0: A Novel Binning Algorithm for NGS Reads and Huge Number of Species

Next-generation sequencing (NGS) technologies allow the sequencing of microbial communities directly from the environment without prior culturing. The output of environmental DNA sequencing consists of many reads from genomes of different unknown species, making the clustering together reads from the same (or similar) species (also known as binning) a crucial step. The difficulties of the binning problem are due to the following four factors: (1) the lack of reference genomes; (2) uneven abundance ratio of species; (3) short NGS reads; and (4) a large number of species (can be more than a hundred). None of the existing binning tools can handle all four factors. No tools, including both AbundanceBin and MetaCluster 3.0, have demonstrated reasonable performance on a sample with more than 20 species. In this article, we introduce MetaCluster 4.0, an unsupervised binning algorithm that can accurately (with about 80% precision and sensitivity in all cases and at least 90% in some cases) and efficiently bin short reads with varying abundance ratios and is able to handle datasets with 100 species. The novelty of MetaCluster 4.0 stems from solving a few important problems: how to divide reads into groups by a probabilistic approach, how to estimate the 4-mer distribution of each group, how to estimate the number of species, and how to modify MetaCluster 3.0 to handle a large number of species. We show that Meta Cluster 4.0 is effective for both simulated and real datasets. Supplementary Material is available at www.liebertonline.com/cmb.

[1]  S. Salzberg,et al.  Phymm and PhymmBL: Metagenomic Phylogenetic Classification with Interpolated Markov Models , 2009, Nature Methods.

[2]  Siu-Ming Yiu,et al.  Unsupervised binning of environmental genomic fragments based on an error robust selection of l-mers , 2009, BMC Bioinformatics.

[3]  C. Deming,et al.  Topographical and Temporal Diversity of the Human Skin Microbiome , 2009, Science.

[4]  Steven Salzberg,et al.  Clustering metagenomic sequences with interpolated Markov models , 2010, BMC Bioinformatics.

[5]  R. Knight,et al.  Microbial community profiling for human microbiome projects: Tools, techniques, and challenges. , 2009, Genome research.

[6]  J. Eisen,et al.  A simple, fast, and accurate method of phylogenomic inference , 2008, Genome Biology.

[7]  Siu-Ming Yiu,et al.  MetaCluster: unsupervised binning of environmental genomic fragments and taxonomic annotation , 2010, BCB '10.

[8]  Natalia Ivanova,et al.  Metagenomic analysis of phosphorus removing sludge communities , 2008 .

[9]  Natalia Ivanova,et al.  Metagenomic analysis of two enhanced biological phosphorus removal (EBPR) sludge communities , 2006, Nature Biotechnology.

[10]  R. Amann,et al.  Combination of 16S rRNA-targeted oligonucleotide probes with flow cytometry for analyzing mixed microbial populations , 1990, Applied and environmental microbiology.

[11]  Yu-Wei Wu,et al.  A Novel Abundance-Based Algorithm for Binning Metagenomic Sequences Using l-Tuples , 2010, RECOMB.

[12]  Yan Boucher,et al.  Use of 16S rRNA and rpoB Genes as Molecular Markers for Microbial Ecology Studies , 2006, Applied and Environmental Microbiology.

[13]  R. Amann,et al.  Application of tetranucleotide frequencies for the assignment of genomic fragments. , 2004, Environmental microbiology.

[14]  Frank Oliver Glöckner,et al.  TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences , 2004, BMC Bioinformatics.

[15]  P. Bork,et al.  A human gut microbial gene catalogue established by metagenomic sequencing , 2010, Nature.

[16]  James R. Cole,et al.  The Ribosomal Database Project (RDP-II): sequences and tools for high-throughput rRNA analysis , 2004, Nucleic Acids Res..

[17]  Mihai Pop,et al.  Finding Biologically Accurate Clusterings in Hierarchical Tree Decompositions Using the Variation of Information , 2009, J. Comput. Biol..

[18]  I. Rigoutsos,et al.  Accurate phylogenetic classification of variable-length DNA fragments , 2007, Nature Methods.

[19]  Chris T. A. Evelo,et al.  The BridgeDb framework: standardized access to gene, protein and metabolite identifier mapping services , 2010, BMC Bioinformatics.

[20]  Zhaojun Bai,et al.  CompostBin: A DNA Composition-Based Algorithm for Binning Environmental Shotgun Reads , 2007, RECOMB.

[21]  A. Halpern,et al.  The Sorcerer II Global Ocean Sampling Expedition: Northwest Atlantic through Eastern Tropical Pacific , 2007, PLoS biology.

[22]  Siu-Ming Yiu,et al.  A robust and accurate binning algorithm for metagenomic sequences with arbitrary species abundance ratio , 2011, Bioinform..

[23]  M. Snyder,et al.  RNA polymerase II stalling: loading at the start prepares genes for a sprint , 2008, Genome Biology.

[24]  R. Knight,et al.  Bacterial Community Variation in Human Body Habitats Across Space and Time , 2009, Science.

[25]  Jonathan A Eisen,et al.  Environmental Shotgun Sequencing: Its Potential and Challenges for Studying the Hidden World of Microbes , 2007, PLoS biology.

[26]  L. Koski,et al.  The Closest BLAST Hit Is Often Not the Nearest Neighbor , 2001, Journal of Molecular Evolution.

[27]  Rustam I. Aminov,et al.  Predominant Role of Host Genetics in Controlling the Composition of Gut Microbiota , 2008, PloS one.

[28]  J. Banfield,et al.  Community structure and metabolism through reconstruction of microbial genomes from the environment , 2004, Nature.

[29]  Raj Acharya,et al.  A two-way multi-dimensional mixture model for clustering metagenomic sequences , 2011, BCB '11.