Predicting gene ontology functions from ProDom and CDD protein domains.

A heuristic algorithm for associating Gene Ontology (GO) defined molecular functions to protein domains as listed in the ProDom and CDD databases is described. The algorithm generates rules for function-domain associations based on the intersection of functions assigned to gene products by the GO consortium that contain ProDom and/or CDD domains at varying levels of sequence similarity. The hierarchical nature of GO molecular functions is incorporated into rule generation. Manual review of a subset of the rules generated indicates an accuracy rate of 87% for ProDom rules and 84% for CDD rules. The utility of these associations is that novel sequences can be assigned a putative function if sufficient similarity exists to a ProDom or CDD domain for which one or more GO functions has been associated. Although functional assignments are increasingly being made for gene products from model organisms, it is likely that the needs of investigators will continue to outpace the efforts of curators, particularly for nonmodel organisms. A comparison with other methods in terms of coverage and agreement was performed, indicating the utility of the approach. The domain-function associations and function assignments are available from our website http://www.cbil.upenn.edu/GO.

[1]  S. Henikoff,et al.  Protein family classification based on searching a database of blocks. , 1994, Genomics.

[2]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[3]  Amos Bairoch,et al.  The PROSITE database, its status in 1999 , 1999, Nucleic Acids Res..

[4]  Jérôme Gouzy,et al.  Recent improvements of the ProDom database of protein domain families , 1999, Nucleic Acids Res..

[5]  Temple F. Smith,et al.  The WD repeat: a common architecture for diverse functions. , 1999, Trends in biochemical sciences.

[6]  Rolf Apweiler,et al.  A novel method for automatic functional annotation of proteins , 1999, Bioinform..

[7]  I. Bach The LIM domain: regulation by association , 2000, Mechanisms of Development.

[8]  R. King,et al.  Accurate Prediction of Protein Functional Class From Sequence in the Mycobacterium Tuberculosis and Escherichia Coli Genomes Using Data Mining , 2000, Yeast.

[9]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[10]  Peer Bork,et al.  SMART: a web-based tool for the study of genetically mobile domains , 2000, Nucleic Acids Res..

[11]  Shmuel Pietrokovski,et al.  Increased coverage of protein families with the Blocks Database servers , 2000, Nucleic Acids Res..

[12]  M. Gerstein,et al.  Annotation Transfer for Genomics: Measuring Functional Divergence in Multi-Domain Proteins , 2001, Genome Research.

[13]  M. Gerstein,et al.  Annotation transfer for genomics: measuring functional divergence in multi-domain proteins. , 2001, Genome research.

[14]  Val Tannen,et al.  K2/Kleisli and GUS: Experiments in integrated access to genomic data sources , 2001, IBM Syst. J..

[15]  J. Blake,et al.  Creating the Gene Ontology Resource : Design and Implementation The Gene Ontology Consortium 2 , 2001 .

[16]  C. Bult,et al.  Functional annotation of a full-length mouse cDNA collection , 2001, Nature.

[17]  Alex Bateman,et al.  The InterPro database, an integrated documentation resource for protein families, domains and functional sites , 2001, Nucleic Acids Res..

[18]  Robert D. Finn,et al.  The Pfam protein families database , 2004, Nucleic Acids Res..