Fast Sequence Clustering Using A Suffix Array Algorithm

MOTIVATION Efficient clustering is important for handling the large amount of available EST sequences. Most contemporary methods are based on some kind of all-against-all comparison, resulting in a quadratic time complexity. A different approach is needed to keep up with the rapid growth of EST data. RESULTS A new, fast EST clustering algorithm is presented. Sub-quadratic time complexity is achieved by using an algorithm based on suffix arrays. A prototype implementation has been developed and run on a benchmark data set. The produced clusterings are validated by comparing them to clusterings produced by other methods, and the results are quite promising. AVAILABILITY The source code for the prototype implementation is available under a GPL license from http://www.ii.uib.no/~ketil/bio/.

[1]  Gregory D. Schuler,et al.  ESTablishing a human transcript map , 1995, Nature Genetics.

[2]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[3]  Robert Giegerich,et al.  A Comparison of Imperative and Purely Functional Suffix Tree Constructions , 1995, Sci. Comput. Program..

[4]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[5]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[6]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[7]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[8]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[9]  Chris Sander,et al.  Removing near-neighbour redundancy from large protein sequence collections , 1998, Bioinform..

[10]  D. Davison,et al.  d2_cluster: a validated method for clustering EST and full-length cDNAsequences. , 1999, Genome research.

[11]  Peter Sanders,et al.  Simple Linear Work Suffix Array Construction , 2003, ICALP.

[12]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[13]  M Vingron,et al.  GeneNest: automated generation and visualization of gene indices. , 2000, Trends in genetics : TIG.

[14]  Xerox Polo,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976 .

[15]  S. Salzberg,et al.  An optimized protocol for analysis of EST sequences. , 2000, Nucleic acids research.