TBC: A clustering algorithm based on prokaryotic taxonomy

High-throughput DNA sequencing technologies have revolutionized the study of microbial ecology. Massive sequencing of PCR amplicons of the 16S rRNA gene has been widely used to understand the microbial community structure of a variety of environmental samples. The resulting sequencing reads are clustered into operational taxonomic units that are then used to calculate various statistical indices that represent the degree of species diversity in a given sample. Several algorithms have been developed to perform this task, but they tend to produce different outcomes. Herein, we propose a novel sequence clustering algorithm, namely Taxonomy-Based Clustering (TBC). This algorithm incorporates the basic concept of prokaryotic taxonomy in which only comparisons to the type strain are made and used to form species while omitting full-scale multiple sequence alignment. The clustering quality of the proposed method was compared with those of MOTHUR, BLASTClust, ESPRIT-Tree, CD-HIT, and UCLUST. A comprehensive comparison using three different experimental datasets produced by pyrosequencing demonstrated that the clustering obtained using TBC is comparable to those obtained using MOTHUR and ESPRIT-Tree and is computationally efficient. The program was written in JAVA and is available from http://sw.ezbiocloud.net/tbc.

[1]  A. Chao Nonparametric estimation of the number of classes in a population , 1984 .

[2]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[3]  A. Chao,et al.  Estimating the Number of Classes via Sample Coverage , 1992 .

[4]  Trinad Chakraborty,et al.  GECO-linear visualization for comparative genomics , 2007, Bioinform..

[5]  Hugh E. Williams,et al.  Clustered Sequence Representation for Fast Homology Search , 2007, J. Comput. Biol..

[6]  A. Chao,et al.  Stopping rules and estimation for recapture debugging with unequal failure rates , 1993 .

[7]  J. Chun,et al.  The analysis of oral microbial communities of wild-type and toll-like receptor 2-deficient mice using a 454 GS FLX Titanium pyrosequencer , 2010, BMC Microbiology.

[8]  R. Knight,et al.  Microbial community profiling for human microbiome projects: Tools, techniques, and challenges. , 2009, Genome research.

[9]  Lawrence G. Wayne,et al.  International Committee on Systematic Bacteriology: Announcement of the Report of the Ad Hoc Committee on Reconciliation of Approaches to Bacterial Systematics , 1988 .

[10]  Adam Godzik,et al.  Clustering of highly homologous sequences to reduce the size of large protein databases , 2001, Bioinform..

[11]  Eugene W. Myers,et al.  Optimal alignments in linear space , 1988, Comput. Appl. Biosci..

[12]  A. Godzik,et al.  Sequence clustering strategies improve remote homology recognitions while reducing search times. , 2002, Protein engineering.

[13]  D. Bacon,et al.  Multiple Sequence Alignment , 1986, Journal of molecular biology.

[14]  Yunpeng Cai,et al.  ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time , 2011, Nucleic acids research.

[15]  A. Godzik,et al.  Probing Metagenomics by Rapid Cluster Analysis of Very Large Datasets , 2008, PloS one.

[16]  Fang Liu,et al.  Molecular analysis of the diversity of vaginal microbiota associated with bacterial vaginosis , 2010, BMC Genomics.

[17]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[18]  J. Retief,et al.  Phylogenetic analysis using PHYLIP. , 2000, Methods in molecular biology.

[19]  Ruth Ann Luna,et al.  Metagenomic pyrosequencing and microbial identification. , 2009, Clinical chemistry.

[20]  F Yang,et al.  Using affinity propagation combined post-processing to cluster protein sequences. , 2010, Protein and peptide letters.

[21]  Martin Hartmann,et al.  Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities , 2009, Applied and Environmental Microbiology.

[22]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[23]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[24]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[25]  S. Hurlbert The Nonconcept of Species Diversity: A Critique and Alternative Parameters. , 1971, Ecology.

[26]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.