CLU: A new algorithm for EST clustering

BackgroundThe continuous flow of EST data remains one of the richest sources for discoveries in modern biology. The first step in EST data mining is usually associated with EST clustering, the process of grouping of original fragments according to their annotation, similarity to known genomic DNA or each other. Clustered EST data, accumulated in databases such as UniGene, STACK and TIGR Gene Indices have proven to be crucial in research areas from gene discovery to regulation of gene expression.ResultsWe have developed a new nucleotide sequence matching algorithm and its implementation for clustering EST sequences. The program is based on the original CLU match detection algorithm, which has improved performance over the widely used d2_cluster. The CLU algorithm automatically ignores low-complexity regions like poly-tracts and short tandem repeats.ConclusionCLU represents a new generation of EST clustering algorithm with improved performance over current approaches. An early implementation can be applied in small and medium-size projects. The CLU program is available on an open source basis free of charge. It can be downloaded from http://compbio.pbrc.edu/pti

[1]  Gregory D. Schuler,et al.  ESTablishing a human transcript map , 1995, Nature Genetics.

[2]  M. Wagner,et al.  IMAGEne I: clustering and ranking of I.M.A.G.E. cDNA clones corresponding to known genes , 1999, Bioinform..

[3]  Jennifer Daub,et al.  Expressed sequence tags: medium-throughput protocols. , 2004, Methods in molecular biology.

[4]  Inge Jonassen,et al.  Fast Sequence Clustering Using A Suffix Array Algorithm , 2003, Bioinform..

[5]  D. Davison,et al.  d2_cluster: a validated method for clustering EST and full-length cDNAsequences. , 1999, Genome research.

[6]  Winston A Hide,et al.  A comprehensive approach to clustering of expressed human gene sequence: the sequence tag alignment and consensus knowledge base. , 1999, Genome research.

[7]  John Quackenbush,et al.  The TIGR Gene Indices: reconstruction and representation of expressed gene sequences , 2000, Nucleic Acids Res..

[8]  Srinivas Aluru,et al.  Efficient clustering of large EST data sets on parallel computers. , 2003, Nucleic acids research.

[9]  Luciano Milanesi,et al.  Fast, statistically based alignment of amino acid sequences on the base of diagonal fragments of DOT-matrices , 1992, Comput. Appl. Biosci..

[10]  M. Boguski,et al.  dbEST — database for “expressed sequence tags” , 1993, Nature Genetics.

[11]  J H Waterborg,et al.  A simple method to make better probes from short DNA fragments , 1994, Molecular biotechnology.

[12]  C. Auffray,et al.  The I.M.A.G.E. Consortium: an integrated molecular analysis of genomes and their expression. , 1996, Genomics.

[13]  Hwa A. Lim,et al.  Data bank homology search algorithm with linear computation complexity , 1994, Comput. Appl. Biosci..