The predictive power of the CluSTr database

SUMMARY The CluSTr database employs a fully automatic single-linkage hierarchical clustering method based on a similarity matrix. In order to compute the matrix, first all-against-all pair-wise comparisons between protein sequences are computed using the Smith-Waterman algorithm. The statistical significance of the similarity scores is then assessed using a Monte Carlo analysis, yielding Z-values, which are used to populate the matrix. This paper describes automated annotation experiments that quantify the predictive power and hence the biological relevance of the CluSTr data. The experiments utilized the UniProt data-mining framework to derive annotation predictions using combinations of InterPro and CluSTr. We show that this combination of data sources greatly increases the precision of predictions made by the data-mining framework, compared with the use of InterPro data alone. We conclude that the CluSTr approach to clustering proteins makes a valuable contribution to traditional protein classifications. AVAILABILITY http://www.ebi.ac.uk/clustr/.

[1]  Rolf Apweiler,et al.  Swissknife - 'lazy parsing' of SWISS-PROT entries , 1999, Bioinform..

[2]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[3]  Jean-Christophe Aude,et al.  Significance of Z-value Statistics of Smith-Waterman Scores for Protein Alignments , 1999, Comput. Chem..

[4]  Darren A. Natale,et al.  The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[5]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[6]  Rolf Apweiler,et al.  Filtering erroneous protein annotation , 2004, ISMB/ECCB.

[7]  Astrid Rakow,et al.  The Aristotle Semantic Network Technology , 2004 .

[8]  Rolf Apweiler,et al.  CluSTr: a database of clusters of SWISS-PROT+TrEMBL proteins , 2001, Nucleic Acids Res..

[9]  Rolf Apweiler,et al.  Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT , 2001, Bioinform..

[10]  Robert S. Ledley,et al.  The Protein Information Resource , 2003, Nucleic Acids Res..

[11]  Rolf Apweiler,et al.  InterProScan - an integration platform for the signature-recognition methods in InterPro , 2001, Bioinform..

[12]  Alex Bateman,et al.  The InterPro Database, 2003 brings increased coverage and new features , 2003, Nucleic Acids Res..

[13]  Olivier Bastien,et al.  Fundamentals of massive automatic pairwise alignments of protein sequences: theoretical significance of Z-value statistics , 2004, Bioinform..

[14]  E. Birney,et al.  The International Protein Index: An integrated database for proteomics experiments , 2004, Proteomics.

[15]  Nathan Linial,et al.  ProtoMap: automatic classification of protein sequences and hierarchy of protein families , 2000, Nucleic Acids Res..