A New Unsupervised Binning Approach for Metagenomic Sequences Based on N-grams and Automatic Feature Weighting

The rapid development of high-throughput technologies enables researchers to sequence the whole metagenome of a microbial community sampled directly from the environment. The assignment of these sequence reads into different species or taxonomical classes is a crucial step for metagenomic analysis, which is referred to as binning of metagenomic data. Most traditional binning methods rely on known reference genomes for accurate assignment of the sequence reads, therefore cannot classify reads from unknown species without the help of close references. To overcome this drawback, unsupervised learning based approaches have been proposed, which need not any known species' reference genome for help. In this paper, we introduce a novel unsupervised method called MCluster for binning metagenomic sequences. This method uses N-grams to extract sequence features and utilizes automatic feature weighting to improve the performance of the basic K-means clustering algorithm. We evaluate MCluster on a variety of simulated data sets and a real data set, and compare it with three latest binning methods: AbundanceBin, MetaCluster 3.0, and MetaCluster 5.0. Experimental results show that MCluster achieves obviously better overall performance ( F-measure) than AbundanceBin and MetaCluster 3.0 on long metagenomic reads ( ≥800 bp); while compared with MetaCluster 5.0, MCluster obtains a larger sensitivity, and a comparable yet more stable F-measure on short metagenomic reads ( bp). This suggests that MCluster can serve as a promising tool for effectively binning metagenomic sequences.

[1]  R. Knight,et al.  Bacterial Community Variation in Human Body Habitats Across Space and Time , 2009, Science.

[2]  Saman K. Halgamuge,et al.  BMC Bioinformatics BioMed Central Methodology article Binning sequences using very sparse labels within a metagenome , 2008 .

[3]  A. Stamatakis,et al.  MLTreeMap - accurate Maximum Likelihood placement of environmental DNA sequences into taxonomic and functional reference phylogenies , 2010, BMC Genomics.

[4]  Florent E. Angly,et al.  Microbial Ecology of Four Coral Atolls in the Northern Line Islands , 2008, PloS one.

[5]  S. Salzberg,et al.  Phymm and PhymmBL: Metagenomic Phylogenetic Classification with Interpolated Markov Models , 2009, Nature Methods.

[6]  Frank Oliver Glöckner,et al.  Current opportunities and challenges in microbial metagenome analysis—a bioinformatic perspective , 2012, Briefings Bioinform..

[7]  Ying Xu,et al.  Barcodes for genomes and applications , 2008, BMC Bioinformatics.

[8]  B. Chor,et al.  Genomic DNA k-mer spectra: models and modalities , 2009, Genome Biology.

[9]  A. Kasarskis,et al.  A window into third-generation sequencing. , 2010, Human molecular genetics.

[10]  Hichem Frigui,et al.  Simultaneous Clustering and Dynamic Keyword Weighting for Text Documents , 2004 .

[11]  Daniel H. Huson,et al.  Methods for comparative metagenomics , 2009, BMC Bioinformatics.

[12]  Yu-Wei Wu,et al.  A Novel Abundance-Based Algorithm for Binning Metagenomic Sequences Using l-Tuples , 2010, RECOMB.

[13]  Shuigeng Zhou,et al.  Effective clustering of microRNA sequences by N-grams and feature weighting , 2012, 2012 IEEE 6th International Conference on Systems Biology (ISB).

[14]  I. Rigoutsos,et al.  Accurate phylogenetic classification of variable-length DNA fragments , 2007, Nature Methods.

[15]  A. Halpern,et al.  The Sorcerer II Global Ocean Sampling Expedition: Northwest Atlantic through Eastern Tropical Pacific , 2007, PLoS biology.

[16]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[17]  John C. Wooley,et al.  A Primer on Metagenomics , 2010, PLoS Comput. Biol..

[18]  Daniel H. Huson,et al.  MetaSim—A Sequencing Simulator for Genomics and Metagenomics , 2008, PloS one.

[19]  Yu-Wei Wu,et al.  A Novel Abundance-Based Algorithm for Binning Metagenomic Sequences Using l-Tuples , 2010, RECOMB.

[20]  S Karlin,et al.  Compositional biases of bacterial genomes and evolutionary implications , 1997, Journal of bacteriology.

[21]  Siu-Ming Yiu,et al.  MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample , 2012, Bioinform..

[22]  R. Daniel The metagenomics of soil , 2005, Nature Reviews Microbiology.

[23]  S. Kanaya,et al.  A novel bioinformatics tool for phylogenetic classification of genomic sequence fragments derived from mixed genomes of uncultured environmental microbes , 2006 .

[24]  Hideaki Sugawara,et al.  Novel phylogenetic studies of genomic sequence fragments derived from uncultured microbe mixtures in environmental and clinical samples. , 2005, DNA research : an international journal for rapid publication of reports on genes and genomes.

[25]  Frank Oliver Glöckner,et al.  TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences , 2004, BMC Bioinformatics.

[26]  Alla Lapidus,et al.  A Bioinformatician's Guide to Metagenomics , 2008, Microbiology and Molecular Biology Reviews.

[27]  Siu-Ming Yiu,et al.  A robust and accurate binning algorithm for metagenomic sequences with arbitrary species abundance ratio , 2011, Bioinform..

[28]  J. Banfield,et al.  Community structure and metabolism through reconstruction of microbial genomes from the environment , 2004, Nature.

[29]  Siu-Ming Yiu,et al.  MetaCluster 4.0: A Novel Binning Algorithm for NGS Reads and Huge Number of Species , 2012, J. Comput. Biol..

[30]  S. Schuster Next-generation sequencing transforms today's biology , 2008, Nature Methods.

[31]  P. Bork,et al.  A human gut microbial gene catalogue established by metagenomic sequencing , 2010, Nature.

[32]  Naryttza N. Diaz,et al.  TACOA – Taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach , 2009, BMC Bioinformatics.

[33]  C. Deming,et al.  Topographical and Temporal Diversity of the Human Skin Microbiome , 2009, Science.

[34]  Alice Carolyn McHardy,et al.  Taxonomic binning of metagenome samples generated by next-generation sequencing technologies , 2012, Briefings Bioinform..