Better Greedy Sequence Clustering with Fast Banded Alignment

Comparing a string to a large set of sequences is a key subroutine in greedy heuristics for clustering genomic data. Clustering 16S rRNA gene sequences into operational taxonomic units (OTUs) is a common method used in studying microbial communities. We present a new approach to greedy clustering using a trie-like data structure and Four Russians speedup. We evaluate the running time of our method in terms of the number of comparisons it makes during clustering and show in experimental results that the number of comparisons grows linearly with the size of the dataset as opposed to the quadratic running time of other methods. We compare the clusters output by our method to the popular greedy clustering tool UCLUST. We show that the clusters we generate can be both tighter and larger. 1998 ACM Subject Classification B.2.4 Algorithms

[1]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[2]  Mihai Pop,et al.  DNACLUST: accurate and efficient clustering of phylogenetic marker genes , 2011, BMC Bioinformatics.

[3]  William A. Walters,et al.  Ultra-high-throughput microbial community analysis on the Illumina HiSeq and MiSeq platforms , 2012, The ISME Journal.

[4]  Eugene W. Myers,et al.  AnO(ND) difference algorithm and its variations , 1986, Algorithmica.

[5]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[6]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[7]  Eugene W. Myers,et al.  A fast bit-vector algorithm for approximate string matching based on dynamic programming , 1998, JACM.

[8]  Mihai Pop,et al.  Alignment and clustering of phylogenetic markers - implications for microbial diversity studies , 2010, BMC Bioinformatics.

[9]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[10]  Mike Paterson,et al.  A Faster Algorithm Computing String Edit Distances , 1980, J. Comput. Syst. Sci..

[11]  Tao Jiang,et al.  On the Complexity of Multiple Sequence Alignment , 1994, J. Comput. Biol..

[12]  A. Uitterlinden,et al.  Profiling of complex microbial populations by denaturing gradient gel electrophoresis analysis of polymerase chain reaction-amplified genes coding for 16S rRNA , 1993, Applied and environmental microbiology.

[13]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[14]  Julie D Thompson,et al.  Multiple Sequence Alignment Using ClustalW and ClustalX , 2003, Current protocols in bioinformatics.

[15]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .