d2_cluster: a validated method for clustering EST and full-length cDNAsequences.

Several efforts are under way to condense single-read expressed sequence tags (ESTs) and full-length transcript data on a large scale by means of clustering or assembly. One goal of these projects is the construction of gene indices where transcripts are partitioned into index classes (or clusters) such that they are put into the same index class if and only if they represent the same gene. Accurate gene indexing facilitates gene expression studies and inexpensive and early partial gene sequence discovery through the assembly of ESTs that are derived from genes that have yet to be positionally cloned or obtained directly through genomic sequencing. We describe d2_cluster, an agglomerative algorithm for rapidly and accurately partitioning transcript databases into index classes by clustering sequences according to minimal linkage or "transitive closure" rules. We then evaluate the relative efficiency of d2_cluster with respect to other clustering tools. UniGene is chosen for comparison because of its high quality and wide acceptance. It is shown that although d2_cluster and UniGene produce results that are between 83% and 90% identical, the joining rate of d2_cluster is between 8% and 20% greater than UniGene. Finally, we present the first published rigorous evaluation of under and over clustering (in other words, of type I and type II errors) of a sequence clustering algorithm, although the existence of highly identical gene paralogs means that care must be taken in the interpretation of the type II error. Upper bounds for these d2_cluster error rates are estimated at 0.4% and 0.8%, respectively. In other words, the sensitivity and selectivity of d2_cluster are estimated to be >99.6% and 99.2%.

[1]  W. Pearson Rapid and sensitive sequence comparison with FASTP and FASTA. , 1990, Methods in enzymology.

[2]  J. Sikela,et al.  Use of 3' untranslated sequences of human cDNAs for rapid chromosome assignment and conversion to STSs: implications for an expression map of the genome. , 1991, Nucleic acids research.

[3]  A. Kerlavage,et al.  Complementary DNA sequencing: expressed sequence tags and human genome project , 1991, Science.

[4]  K. Okubo,et al.  A novel system for large-scale sequencing of cDNA by PCR amplification. , 1991, DNA sequence : the journal of DNA sequencing and mapping.

[5]  Kousaku Okubo,et al.  Large scale cDNA sequencing for analysis of quantitative and qualitative aspects of gene expression , 1992, Nature Genetics.

[6]  J. Craig Venter,et al.  Sequence identification of 2,375 human brain genes , 1992, Nature.

[7]  M. Boguski,et al.  dbEST — database for “expressed sequence tags” , 1993, Nature Genetics.

[8]  J. Venter Identification of new human receptor and transporter genes by high throughput cDNA (EST) sequencing. , 1993, The Journal of pharmacy and pharmacology.

[9]  Venter Jc Identification of new human receptor and transporter genes by high throughput cDNA (EST) sequencing. , 1993 .

[10]  K. Okubo,et al.  Identification of new genes by systematic analysis of cDNAs and database construction. , 1993, Current opinion in biotechnology.

[11]  E. Sonnhammer,et al.  Modular arrangement of proteins as inferred from analysis of homology , 1994, Protein science : a publication of the Protein Society.

[12]  Winston Hide,et al.  Biological Evaluation of d2, an Algorithm for High-Performance Sequence Comparison , 1994, J. Comput. Biol..

[13]  K. Okubo,et al.  An expression profile of active genes in human colonic mucosa. , 1994, DNA research : an international journal for rapid publication of reports on genes and genomes.

[14]  Owen White,et al.  TIGR Assembler: A New Tool for Assembling Large Shotgun Sequencing Projects , 1995 .

[15]  R. F. Smith,et al.  BEAUTY: an enhanced BLAST-based search tool that integrates multiple biological information resources into sequence similarity search results. , 1995, Genome research.

[16]  R. Fleischmann,et al.  Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA sequence. , 1995, Nature.

[17]  S. Bentolila,et al.  The Genexpress Index: a resource for gene discovery and the genic map of the human genome. , 1995, Genome research.

[18]  L Kruglyak,et al.  An STS-Based Map of the Human Genome , 1995, Science.

[19]  Gregory D. Schuler,et al.  ESTablishing a human transcript map , 1995, Nature Genetics.

[20]  J. D. Parsons,et al.  Improved tools for DNA comparison and clustering , 1995, Comput. Appl. Biosci..

[21]  P. Deloukas,et al.  A Gene Map of the Human Genome , 1996, Science.

[22]  Graziano Pesole,et al.  CLEANUP: a fast computer program for removing redundancies from nucleotide sequence databases , 1996, Comput. Appl. Biosci..

[23]  O. White,et al.  TDB: new databases for biological discovery. , 1996, Methods in enzymology.

[24]  S. Taylor,et al.  A new dynamic tool to perform assembly of expressed sequence tags (ESTs) , 1997, Comput. Appl. Biosci..

[25]  R. Durbin,et al.  Pfam: A comprehensive database of protein domain families based on seed alignments , 1997, Proteins.

[26]  D. Davison,et al.  A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words. , 1997, Biometrics.

[27]  Alan Christoffels,et al.  A Novel Approach Towards a Comprehensive Consensus Representation of the Expressed Human Genome , 1997 .

[28]  Darrell Conklin,et al.  Automated Clustering and Assembly of Large EST Collections , 1998, ISMB.

[29]  Barbara A. Eckman,et al.  The Merck Gene Index browser: an extensible data integration system for gene finding, gene characterization and EST data mining , 1998, Bioinform..

[30]  Jérôme Gracy,et al.  Automated protein sequence database classification. I. Integration of compositional similarity search, local similarity search, and multiple sequence alignment , 1998, Bioinform..

[31]  D B Davison,et al.  Alternative gene form discovery and candidate gene selection from gene indexing projects. , 1998, Genome research.

[32]  I. Pastan,et al.  Discovery of three genes specifically expressed in human prostate by expressed sequence tag database analysis. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[33]  Jérôme Gracy,et al.  Automated protein sequence database classification. II. Delineation Of domain boundaries from sequence similarities , 1998, Bioinform..

[34]  S Audic,et al.  Alternate polyadenylation in human mRNAs: a large-scale analysis by EST clustering. , 1998, Genome research.

[35]  C. Auffray,et al.  The Genexpress IMAGE knowledge base of the human brain transcriptome: a prototype integrated resource for functional and computational genomics. , 1999, Genome research.

[36]  A. Chou,et al.  CRAWview: for viewing splicing variation, gene families, and polymorphism in clusters of ESTs and full-length sequences , 1999, Bioinform..

[37]  Winston A Hide,et al.  A comprehensive approach to clustering of expressed human gene sequence: the sequence tag alignment and consensus knowledge base. , 1999, Genome research.