Post-processing of BLAST results using databases of clustered sequences

MOTIVATION When evaluating the results of a sequence similarity search, there are many situations where it can be useful to determine whether sequences appearing in the results share some distinguishing characteristic. Such dependencies between database entries are often not readily identifiable, but can yield important new insights into the biological function of a gene or protein. RESULTS We have developed a program called CBLAST that sorts the results of a BLAST sequence similarity search according to sequence membership in user-defined 'clusters' of sequences. To demonstrate the utility of this application, we have constructed two cluster databases. The first describes clusters of nucleotide sequences representing the same gene, as documented in the UNIGENE database, and the second describes clusters of protein sequences which are members of the protein families documented in the PROSITE database. Cluster databases and the CBLAST post-processor provide an efficient mechanism for identifying and exploring relationships and dependencies between new sequences and database entries.

[1]  Gregory D. Schuler,et al.  ESTablishing a human transcript map , 1995, Nature Genetics.

[2]  E. Sonnhammer,et al.  Modular arrangement of proteins as inferred from analysis of homology , 1994, Protein science : a publication of the Protein Society.

[3]  Mark Dubnick,et al.  Btab - a Blast output parser , 1992, Comput. Appl. Biosci..

[4]  M. Boguski,et al.  dbEST — database for “expressed sequence tags” , 1993, Nature Genetics.

[5]  John C. Wootton,et al.  Statistics of Local Complexity in Amino Acid Sequences and Sequence Databases , 1993, Comput. Chem..

[6]  Jean-Michel Claverie,et al.  Information Enhancement Methods for Large Scale Sequence Analysis , 1993, Comput. Chem..

[7]  S. Altschul,et al.  Issues in searching molecular sequence databases , 1994, Nature Genetics.

[8]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Erik L. L. Sonnhammer,et al.  A workbench for large-scale sequence homology analysis , 1994, Comput. Appl. Biosci..

[10]  L. F. Kolakowski GCRDb: a G-protein-coupled receptor database. , 1994, Receptors & channels.

[11]  Amos Bairoch,et al.  The PROSITE database, its status in 1997 , 1997, Nucleic Acids Res..

[12]  J. Wootton,et al.  Construction of validated, non-redundant composite protein sequence databases. , 1990, Protein engineering.

[13]  C. Auffray,et al.  The I.M.A.G.E. Consortium: an integrated molecular analysis of genomes and their expression. , 1996, Genomics.

[14]  R. Harper World Wide Web resources for the biologist. , 1995, Trends in genetics : TIG.

[15]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its new supplement TREMBL , 1996, Nucleic Acids Res..

[16]  Kolakowski Lf GCRDB: A G-PROTEIN-COUPLED RECEPTOR DATABASE , 1994 .

[17]  R. F. Smith,et al.  BEAUTY: an enhanced BLAST-based search tool that integrates multiple biological information resources into sequence similarity search results. , 1995, Genome research.

[18]  A. Kerlavage,et al.  Complementary DNA sequencing: expressed sequence tags and human genome project , 1991, Science.

[19]  Amos Bairoch,et al.  The PROSITE database, its status in 1995 , 1996, Nucleic Acids Res..

[20]  Eugene V. Koonin,et al.  A simple tool to search for sequence motifs that are conserved in BLAST outputs , 1994, Comput. Appl. Biosci..

[21]  Sándor Pongor,et al.  The SBASE protein domain library, Release 4.0: a collection of annotated protein sequence segments , 1993, Nucleic Acids Res..

[22]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.