SIMCOMP: A Hybrid Soft Clustering of Metagenome Reads

A major challenge facing metagenomics is the development of tools for the characterization of functional and taxonomic content of vast amounts of short metagenome reads. In this paper, we present a two pass semi-supervised algorithm, SimComp, for soft clustering of short metagenome reads, that is a hybrid of comparative and composition based methods. In the first pass, a comparative analysis of the metagenome reads against BLASTx extracts the reference sequences from within the metagenome to form an initial set of seeded clusters. Those reads that have a significant match to the database are clustered by their phylogenetic provenance. In the second pass, the remaining fraction of reads are characterized by their species-specific composition based characteristics. SimComp groups the reads into overlapping clusters, each with its read leader. We make no assumptions about the taxonomic distribution of the dataset. The overlap between the clusters elegantly handles the challenges posed by the nature of the metagenomic data. The resulting cluster leaders can be used as an accurate estimate of the phylogenetic composition of the metagenomic dataset. Our method enriches the dataset into a small number of clusters, while accurately assigning fragments as small as 100 base pairs.

[1]  S. Salzberg,et al.  Bioinformatics challenges of new sequencing technology. , 2008, Trends in genetics : TIG.

[2]  Antoine Danchin,et al.  Codon Usage Domains over Bacterial Chromosomes , 2006, PLoS Comput. Biol..

[3]  Saman K. Halgamuge,et al.  BMC Bioinformatics BioMed Central Methodology article Binning sequences using very sparse labels within a metagenome , 2008 .

[4]  R. Amann,et al.  Application of tetranucleotide frequencies for the assignment of genomic fragments. , 2004, Environmental microbiology.

[5]  Lior Pachter,et al.  Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities , 2005, PLoS Comput. Biol..

[6]  Zhaojun Bai,et al.  CompostBin: A DNA Composition-Based Algorithm for Binning Environmental Shotgun Reads , 2007, RECOMB.

[7]  Elena Marchiori,et al.  Clustering Metagenome Short Reads Using Weighted Proteins , 2009, EvoBIO.

[8]  J. Parkhill,et al.  Comparative genomic structure of prokaryotes. , 2004, Annual review of genetics.

[9]  M. Narasimha Murty,et al.  An adaptive rough fuzzy single pass algorithm for clustering large data sets , 2003, Pattern Recognit..

[10]  S. Giovannoni,et al.  The uncultured microbial majority. , 2003, Annual review of microbiology.

[11]  S. Salzberg,et al.  Phymm and PhymmBL: Metagenomic Phylogenetic Classification with Interpolated Markov Models , 2009, Nature Methods.

[12]  I. Rigoutsos,et al.  Accurate phylogenetic classification of variable-length DNA fragments , 2007, Nature Methods.

[13]  Sean D. Hooper,et al.  Annotation of metagenome short reads using proxygenes , 2008, ECCB.

[14]  Frank Oliver Glöckner,et al.  TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences , 2004, BMC Bioinformatics.

[15]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[16]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[17]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.

[18]  S Karlin,et al.  Heterogeneity of genomes: measures and values. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[19]  F.C. Harris,et al.  A fuzzy classifier to taxonomically group DNA fragments within a metagenome , 2008, NAFIPS 2008 - 2008 Annual Meeting of the North American Fuzzy Information Processing Society.

[20]  Jonathan Dushoff,et al.  Unsupervised statistical clustering of environmental shotgun sequences , 2009, BMC Bioinformatics.