论文信息 - A set-theoretic approach to database searching and clustering

A set-theoretic approach to database searching and clustering

MOTIVATION In this paper, we introduce an iterative method of database searching and apply it to design a database clustering algorithm applicable to an entire protein database. The clustering procedure relies on the quality of the database searching routine and further improves its results based on a set-theoretic analysis of a highly redundant yet efficient to generate cluster system. RESULTS Overall, we achieve unambiguous assignment of 80% of SWISS-PROT sequences to non-overlapping sequence clusters in an entirely automatic fashion. Our results are compared to an expert-generated clustering for validation. The database searching method is fast and the clustering technique does not require time-consuming all-against-all comparison. This allows for fast clustering of large amounts of sequences. AVAILABILITY The resulting clustering for the PIR1 (Release 51) and SWISS-PROT (Release 34) databases is available over the Internet from http://www.dkfz-heidelberg.de/tbi/services/modest/b rowsesysters.pl. CONTACT a.krause@dkfz-heidelberg.de; m.vingron@dkfz-heidelberg.de

Martin Vingron | Antje Krause | M. Vingron | A. Krause

[1] Peter H. A. Sneath,et al. Numerical Taxonomy: The Principles and Practice of Numerical Classification , 1973 .

[2] W. Pearson. Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. , 1991, Genomics.

[3] E. Sonnhammer,et al. Modular arrangement of proteins as inferred from analysis of homology , 1994, Protein science : a publication of the Protein Society.

[4] Kurt Mehlhorn,et al. LEDA: a platform for combinatorial and geometric computing , 1997, CACM.

[5] D. Lipman,et al. Extracting protein alignment models from the sequence database. , 1997, Nucleic acids research.

[6] W. Gehring,et al. Homeodomain proteins. , 1994, Annual review of biochemistry.

[7] D. Lipman,et al. Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[8] S. Henikoff,et al. Automated assembly of protein blocks for database searching. , 1991, Nucleic acids research.

[9] Rolf Apweiler,et al. The SWISS-PROT protein sequence data bank and its supplement TrEMBL , 1997, Nucleic Acids Res..

[10] Robert S. Ledley,et al. The Protein Information Resource (PIR) and the PIR-International Protein Sequence Database , 1997, Nucleic Acids Res..

[11] J. Thompson,et al. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[12] Amos Bairoch,et al. The PROSITE database, its status in 1997 , 1997, Nucleic Acids Res..

[13] H. Matsuda,et al. A Clustering Method for Molecular Sequences based on Pairwise Similarity , 1996 .

[14] M S Waterman,et al. Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[15] W C Barker,et al. Superfamily classification in PIR-International Protein Sequence Database. , 1996, Methods in enzymology.

[16] W. Pearson. Comparison of methods for searching protein sequence databases , 1995, Protein science : a publication of the Protein Society.

[17] E. Myers,et al. Basic local alignment search tool. , 1990, Journal of molecular biology.

[18] Richard Hughey,et al. Parallel hardware for sequence comparison and alignment , 1996, Comput. Appl. Biosci..