Automated discovery of 3D motifs for protein function annotation

MOTIVATION Function inference from structure is facilitated by the use of patterns of residues (3D motifs), normally identified by expert knowledge, that correlate with function. As an alternative to often limited expert knowledge, we use machine-learning techniques to identify patterns of 3-10 residues that maximize function prediction. This approach allows us to test the assumption that residues that provide function are the most informative for predicting function. RESULTS We apply our method, GASPS, to the haloacid dehalogenase, enolase, amidohydrolase and crotonase superfamilies and to the serine proteases. The motifs found by GASPS are as good at function prediction as 3D motifs based on expert knowledge. The GASPS motifs with the greatest ability to predict protein function consist mainly of known functional residues. However, several residues with no known functional role are equally predictive. For four groups, we show that the predictive power of our 3D motifs is comparable with or better than approaches that use the entire fold (Combinatorial-Extension) or sequence profiles (PSI-BLAST). AVAILABILITY Source code is freely available for academic use by contacting the authors. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  M J Sternberg,et al.  Analysis and prediction of the location of catalytic residues in enzymes. , 1988, Protein engineering.

[2]  Robert B. Russell,et al.  Annotation in three dimensions , 2003 .

[3]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[4]  Robert B. Russell,et al.  Annotation in three dimensions. PINTS: Patterns in Non-homologous Tertiary Structures , 2003, Nucleic Acids Res..

[5]  R. Russell,et al.  Detection of protein three-dimensional side-chain patterns: new examples of convergent evolution. , 1998, Journal of molecular biology.

[6]  W. S. Valdar,et al.  Scoring residue conservation , 2002, Proteins.

[7]  J. Thornton,et al.  Tess: A geometric hashing algorithm for deriving 3D coordinate templates for searching structural databases. Application to enzyme active sites , 1997, Protein science : a publication of the Protein Society.

[8]  C. Chothia,et al.  Determination of protein function, evolution and interactions by structural genomics. , 2001, Current opinion in structural biology.

[9]  M. DePristo,et al.  Heterogeneity and inaccuracy in protein structures solved by X-ray crystallography. , 2004, Structure.

[10]  Yang Zhang,et al.  Large-scale assessment of the utility of low-resolution protein structures for biochemical function assignment , 2004, Bioinform..

[11]  Conrad C. Huang,et al.  Representing Structure-Function Relationships in Mechanistically Diverse Enzyme Superfamilies , 2004, Pacific Symposium on Biocomputing.

[12]  John A Gerlt,et al.  Evolution of function in ( b / a ) 8-barrel enzymes , 2003 .

[13]  P. Willett,et al.  A graph-theoretic approach to the identification of three-dimensional patterns of amino acid side-chains in protein structures. , 1994, Journal of molecular biology.

[14]  J. Skolnick,et al.  Method for prediction of protein function from sequence using the sequence-to-structure-to-function paradigm with application to glutaredoxins/thioredoxins and T1 ribonucleases. , 1998, Journal of molecular biology.

[15]  P. Babbitt,et al.  Divergent evolution of enzymatic function: mechanistically diverse superfamilies and functionally distinct suprafamilies. , 2001, Annual review of biochemistry.

[16]  P. Babbitt,et al.  Superfamily active site templates , 2004, Proteins.

[17]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[18]  P E Bourne,et al.  Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. , 1998, Protein engineering.

[19]  Janet M. Thornton,et al.  An algorithm for constraint-based structural template matching: application to 3D templates with statistical analysis , 2003, Bioinform..

[20]  A. Elcock Prediction of functionally important residues based solely on the computed energetics of protein structure. , 2001, Journal of molecular biology.

[21]  Ashish V. Tendulkar,et al.  Functional sites in protein families uncovered via an objective and automated graph theoretic approach. , 2003, Journal of molecular biology.

[22]  John Alan Gerlt,et al.  Evolution of function in (β/α)8-barrel enzymes , 2003 .

[23]  G. H. Reed,et al.  The enolase superfamily: a general strategy for enzyme-catalyzed abstraction of the alpha-protons of carboxylic acids. , 1996, Biochemistry.

[24]  A Wlodawer,et al.  Catalytic triads and their relatives. , 1998, Trends in biochemical sciences.

[25]  Gail J. Bartlett,et al.  Analysis of catalytic residues in enzyme active sites. , 2002, Journal of molecular biology.

[26]  Gail J. Bartlett,et al.  Using a library of structural templates to recognise catalytic sites and explore their evolution in homologous families. , 2005, Journal of molecular biology.

[27]  Karen N. Allen,et al.  Phosphoryl group transfer: evolution of a catalytic scaffold. , 2004, Trends in biochemical sciences.

[28]  G J Kleywegt,et al.  Recognition of spatial motifs in protein structures. , 1999, Journal of molecular biology.

[29]  C Sander,et al.  An evolutionary treasure: unification of a broad set of amidohydrolases related to urease , 1997, Proteins.

[30]  P. Babbitt Definitions of enzyme function for the structural genomics era. , 2003, Current opinion in chemical biology.

[31]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[32]  A. Lesk,et al.  The relation between the divergence of sequence and structure in proteins. , 1986, The EMBO journal.

[33]  Robert B Russell,et al.  Finding functional sites in structural genomics proteins. , 2004, Structure.

[34]  Janet M. Thornton,et al.  The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data , 2004, Nucleic Acids Res..

[35]  T J Oldfield,et al.  Data mining the protein data bank: Residue interactions , 2002, Proteins.

[36]  H M Holden,et al.  The crotonase superfamily: divergently related enzymes that catalyze different reactions involving acyl coenzyme a thioesters. , 2001, Accounts of chemical research.

[37]  Ivan Rayment,et al.  Divergent evolution in the enolase superfamily: the interplay of mechanism and specificity. , 2005, Archives of biochemistry and biophysics.