DNACLUST: accurate and efficient clustering of phylogenetic marker genes

BackgroundClustering is a fundamental operation in the analysis of biological sequence data. New DNA sequencing technologies have dramatically increased the rate at which we can generate data, resulting in datasets that cannot be efficiently analyzed by traditional clustering methods.This is particularly true in the context of taxonomic profiling of microbial communities through direct sequencing of phylogenetic markers (e.g. 16S rRNA) - the domain that motivated the work described in this paper. Many analysis approaches rely on an initial clustering step aimed at identifying sequences that belong to the same operational taxonomic unit (OTU). When defining OTUs (which have no universally accepted definition), scientists must balance a trade-off between computational efficiency and biological accuracy, as accurately estimating an environment's phylogenetic composition requires computationally-intensive analyses. We propose that efficient and mathematically well defined clustering methods can benefit existing taxonomic profiling approaches in two ways: (i) the resulting clusters can be substituted for OTUs in certain applications; and (ii) the clustering effectively reduces the size of the data-sets that need to be analyzed by complex phylogenetic pipelines (e.g., only one sequence per cluster needs to be provided to downstream analyses).ResultsTo address the challenges outlined above, we developed DNACLUST, a fast clustering tool specifically designed for clustering highly-similar DNA sequences.Given a set of sequences and a sequence similarity threshold, DNACLUST creates clusters whose radius is guaranteed not to exceed the specified threshold. Underlying DNACLUST is a greedy clustering strategy that owes its performance to novel sequence alignment and k- mer based filtering algorithms.DNACLUST can also produce multiple sequence alignments for every cluster, allowing users to manually inspect clustering results, and enabling more detailed analyses of the clustered data.ConclusionsWe compare DNACLUST to two popular clustering tools: CD-HIT and UCLUST. We show that DNACLUST is about an order of magnitude faster than CD-HIT and UCLUST (exact mode) and comparable in speed to UCLUST (approximate mode). The performance of DNACLUST improves as the similarity threshold is increased (tight clusters) making it well suited for rapidly removing duplicates and near-duplicates from a dataset, thereby reducing the size of the data being analyzed through more elaborate approaches.

[1]  Robert Sedgewick,et al.  Fast algorithms for sorting and searching strings , 1997, SODA '97.

[2]  Susan M. Huse,et al.  Ironing out the wrinkles in the rare biosphere through improved OTU clustering , 2010, Environmental microbiology.

[3]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[4]  B L Maidak,et al.  The RDP-II (Ribosomal Database Project) , 2001, Nucleic Acids Res..

[5]  Mihai Pop,et al.  Finding Biologically Accurate Clusterings in Hierarchical Tree Decompositions Using the Variation of Information , 2009, J. Comput. Biol..

[6]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[7]  Mihai Pop,et al.  Inexact Local Alignment Search over Suffix Arrays , 2009, 2009 IEEE International Conference on Bioinformatics and Biomedicine.

[8]  J. Tiedje,et al.  Naïve Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy , 2007, Applied and Environmental Microbiology.

[9]  Adam Godzik,et al.  Clustering of highly homologous sequences to reduce the size of large protein databases , 2001, Bioinform..

[10]  Vijay V. Vazirani,et al.  Approximation Algorithms , 2001, Springer Berlin Heidelberg.

[11]  Mihai Pop,et al.  Alignment and clustering of phylogenetic markers - implications for microbial diversity studies , 2010, BMC Bioinformatics.

[12]  C. Quince,et al.  Accurate determination of microbial diversity from 454 pyrosequencing data , 2009, Nature Methods.

[13]  Martin Hartmann,et al.  Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities , 2009, Applied and Environmental Microbiology.

[14]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[15]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[16]  B. Roe,et al.  A core gut microbiome in obese and lean twins , 2008, Nature.

[17]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[18]  BMC Bioinformatics , 2005 .

[19]  Edward Fredkin,et al.  Trie memory , 1960, Commun. ACM.

[20]  Esko Ukkonen,et al.  Approximate String-Matching over Suffix Trees , 1993, CPM.

[21]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[22]  J. Handelsman,et al.  Introducing DOTUR, a Computer Program for Defining Operational Taxonomic Units and Estimating Species Richness , 2005, Applied and Environmental Microbiology.