Structure‐based identification of catalytic residues

The identification of catalytic residues is an essential step in functional characterization of enzymes. We present a purely structural approach to this problem, which is motivated by the difficulty of evolution‐based methods to annotate structural genomics targets that have few or no homologs in the databases. Our approach combines a state‐of‐the‐art support vector machine (SVM) classifier with novel structural features that augment structural clues by spatial averaging and Z scoring. Special attention is paid to the class imbalance problem that stems from the overwhelming number of non‐catalytic residues in enzymes compared to catalytic residues. This problem is tackled by: (1) optimizing the classifier to maximize a performance criterion that considers both Type I and Type II errors in the classification of catalytic and non‐catalytic residues; (2) under‐sampling non‐catalytic residues before SVM training; and (3) during SVM training, penalizing errors in learning catalytic residues more than errors in learning non‐catalytic residues. Tested on four enzyme datasets, one specifically designed by us to mimic the structural genomics scenario and three previously evaluated datasets, our structure‐based classifier is never inferior to similar structure‐based classifiers and comparable to classifiers that use both structural and evolutionary features. In addition to the evaluation of the performance of catalytic residue identification, we also present detailed case studies on three proteins. This analysis suggests that many false positive predictions may correspond to binding sites and other functional residues. A web server that implements the method, our own‐designed database, and the source code of the programs are publicly available at http://www.cs.bgu.ac.il/∼meshi/functionPrediction. Proteins 2011; © 2011 Wiley‐Liss, Inc.

[1]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[2]  David S. Eisenberg,et al.  Finding families for genomic ORFans , 1999, Bioinform..

[3]  Kengo Kinoshita,et al.  Protein informatics towards function identification. , 2003, Current opinion in structural biology.

[4]  M. Ondrechen,et al.  THEMATICS: A simple computational predictor of enzyme function from structure , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[5]  A. Warshel,et al.  Energetics of enzyme catalysis. , 1978, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Gil Amitai,et al.  Network analysis of protein structures identifies functional residues. , 2004, Journal of molecular biology.

[7]  B K Shoichet,et al.  A relationship between protein stability and protein function. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[8]  R. Sauer,et al.  Amino acid substitutions that increase the thermal stability of the λ Cro protein , 1989 .

[9]  Janet M. Thornton,et al.  The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data , 2004, Nucleic Acids Res..

[10]  Cathy H. Wu,et al.  Prediction of catalytic residues using Support Vector Machine with selected protein sequence and structural properties , 2006, BMC Bioinformatics.

[11]  P. Suganthan,et al.  Identification of catalytic residues from protein structure using support vector machine with sequence and structural features. , 2008, Biochemical and biophysical research communications.

[12]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[13]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[14]  Eibe Frank,et al.  Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms , 2004, PAKDD.

[15]  Shoshana J. Wodak,et al.  Relating destabilizing regions to known functional sites in proteins , 2007, BMC Bioinformatics.

[16]  Irene T Weber,et al.  Analysis of protein structures reveals regions of rare backbone conformation at functional sites , 2003, Proteins.

[17]  R. Sauer,et al.  Amino acid substitutions that increase the thermal stability of the lambda Cro protein. , 1989, Proteins.

[18]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[19]  Steven E Brenner,et al.  Target selection and deselection at the Berkeley Structural Genomics Center , 2005, Proteins.

[20]  D. Baker,et al.  Improvement in protein functional site prediction by distinguishing structural and functional constraints on protein family evolution using computational design , 2005, Nucleic acids research.

[21]  J Moult,et al.  Analysis of the steric strain in the polypeptide backbone of protein molecules , 1991, Proteins.

[22]  A. Elcock Prediction of functionally important residues based solely on the computed energetics of protein structure. , 2001, Journal of molecular biology.

[23]  P. Argos,et al.  Strain in protein structures as viewed through nonrotameric side chains: II. effects upon ligand binding , 1999, Proteins.

[24]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[25]  A. Dillmann Enzyme Nomenclature , 1965, Nature.

[26]  E. Webb Enzyme nomenclature 1992. Recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology on the Nomenclature and Classification of Enzymes. , 1992 .

[27]  P. Radivojac,et al.  Evaluation of features for catalytic residue prediction in novel folds , 2007 .

[28]  K. Nishikawa,et al.  Prediction of catalytic residues in enzymes based on known tertiary structure, stability profile, and sequence conservation. , 2003, Journal of molecular biology.

[29]  M. Eisenstein,et al.  Looking at enzymes from the inside out: the proximity of catalytic residues to the molecular centroid can be used for detection of active sites and enzyme-ligand interfaces. , 2005, Journal of molecular biology.

[30]  J. Warwicker,et al.  Enzyme/non-enzyme discrimination and prediction of enzyme active site location using charge-based methods. , 2004, Journal of molecular biology.

[31]  A. Sali,et al.  Structural genomics: beyond the Human Genome Project , 1999, Nature Genetics.

[32]  J. Gerlt,et al.  Site-directed mutants of staphylococcal nuclease. Detection and localization by 1H NMR spectroscopy of conformational changes accompanying substitutions for glutamic acid-43. , 1987, Biochemistry.

[33]  G Schreiber,et al.  Stability and function: two constraints in the evolution of barstar and other proteins. , 1994, Structure.

[34]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[35]  Ying Wei,et al.  Selective prediction of interaction sites in protein structures with THEMATICS , 2007, BMC Bioinformatics.

[36]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[37]  Peter G Wolynes,et al.  Localizing frustration in native proteins and protein assemblies , 2007, Proceedings of the National Academy of Sciences.

[38]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[39]  Beth M Beadle,et al.  Structural bases of stability-function tradeoffs in enzymes. , 2002, Journal of molecular biology.

[40]  Arieh Warshel,et al.  Dynamical contributions to enzyme catalysis: critical tests of a popular hypothesis. , 2006, Chemical reviews.

[41]  Daniel Fischer,et al.  Identification and investigation of ORFans in the viral world , 2008, BMC Genomics.

[42]  Boaz Lerner,et al.  Support vector machine-based image classification for genetic syndrome diagnosis , 2005, Pattern Recognit. Lett..

[43]  F. K. Gleason,et al.  Mutation of conserved residues in Escherichia coli thioredoxin: Effects on stability and function , 1992, Protein science : a publication of the Protein Society.

[44]  J. Thornton,et al.  Searching for functional sites in protein structures. , 2004, Current opinion in chemical biology.

[45]  L Serrano,et al.  Effect of active site residues in barnase on activity and stability. , 1992, Journal of molecular biology.

[46]  C. Orengo,et al.  Evolution of protein function, from a structural perspective. , 1999, Current opinion in chemical biology.

[47]  Gail J. Bartlett,et al.  Using a neural network and spatial clustering to predict the location of active sites in enzymes. , 2003, Journal of molecular biology.

[48]  Ronald J. Williams,et al.  Enhanced performance in prediction of protein active sites with THEMATICS and support vector machines , 2008, Protein science : a publication of the Protein Society.

[49]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[50]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[51]  Gail J. Bartlett,et al.  Analysis of catalytic residues in enzyme active sites. , 2002, Journal of molecular biology.

[52]  Dan Reshef,et al.  MESHI: a new library of Java classes for molecular modeling , 2005, Bioinform..

[53]  A. Warshel,et al.  Evaluation of catalytic free energies in genetically modified proteins. , 1988, Journal of molecular biology.

[54]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.