Combining Structure and Sequence Information Allows Automated Prediction of Substrate Specificities within Enzyme Families

An important aspect of the functional annotation of enzymes is not only the type of reaction catalysed by an enzyme, but also the substrate specificity, which can vary widely within the same family. In many cases, prediction of family membership and even substrate specificity is possible from enzyme sequence alone, using a nearest neighbour classification rule. However, the combination of structural information and sequence information can improve the interpretability and accuracy of predictive models. The method presented here, Active Site Classification (ASC), automatically extracts the residues lining the active site from one representative three-dimensional structure and the corresponding residues from sequences of other members of the family. From a set of representatives with known substrate specificity, a Support Vector Machine (SVM) can then learn a model of substrate specificity. Applied to a sequence of unknown specificity, the SVM can then predict the most likely substrate. The models can also be analysed to reveal the underlying structural reasons determining substrate specificities and thus yield valuable insights into mechanisms of enzyme specificity. We illustrate the high prediction accuracy achieved on two benchmark data sets and the structural insights gained from ASC by a detailed analysis of the family of decarboxylating dehydrogenases. The ASC web service is available at http://asc.informatik.uni-tuebingen.de/.

[1]  S. Wold,et al.  Principal property values for six non-natural amino acids and their application to a structure–activity relationship for oxytocin peptide analogues , 1987 .

[2]  R. Russell,et al.  Analysis and prediction of functional sub-types from protein sequence alignments. , 2000, Journal of molecular biology.

[3]  J B Hurley,et al.  Two amino acid substitutions convert a guanylyl cyclase, RetGC-1, into an adenylyl cyclase. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[4]  A. Valencia Automatic annotation of protein function. , 2005, Current opinion in structural biology.

[5]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[6]  Ridong Chen,et al.  Functional prediction: Identification of protein orthologs and paralogs , 2000, Protein science : a publication of the Protein Society.

[7]  G. Challis,et al.  Predictive, structure-based model of amino acid recognition by nonribosomal peptide synthetase adenylation domains. , 2000, Chemistry & biology.

[8]  J. Hurley,et al.  Structure of the adenylyl cyclase catalytic core , 1997, Nature.

[9]  J. Skolnick,et al.  How well is enzyme function conserved as a function of pairwise sequence identity? , 2003, Journal of molecular biology.

[10]  Antje Chang,et al.  BRENDA, AMENDA and FRENDA the enzyme information system: new content and tools in 2009 , 2008, Nucleic Acids Res..

[11]  Robert D. Finn,et al.  Pfam: clans, web tools and services , 2005, Nucleic Acids Res..

[12]  H. Muirhead,et al.  A specific, highly active malate dehydrogenase by redesign of a lactate dehydrogenase framework. , 1988, Science.

[13]  Hans-Peter Lenhof,et al.  BALLView: a tool for research and education in molecular modeling , 2006, Bioinform..

[14]  Mona Singh,et al.  Characterization and prediction of residues determining protein functional specificity , 2008, Bioinform..

[15]  Cédric Notredame,et al.  3DCoffee: combining protein sequences and structures within multiple sequence alignments. , 2004, Journal of molecular biology.

[16]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[17]  Susan S. Taylor,et al.  How do protein kinases discriminate between serine/threonine and tyrosine? Structural insights from the insulin receptor protein‐tyrosine kinase , 1995, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[18]  B. Rost Enzyme function less conserved than anticipated. , 2002, Journal of molecular biology.

[19]  Jon Clardy,et al.  Atomic Structure of the Trypsin-Aeruginosin 98-B Complex , 1998 .

[20]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt) , 2006, Nucleic Acids Research.

[21]  M. Hall,et al.  Crystal structure of a ternary complex of Escherichia coli malate dehydrogenase citrate and NAD at 1.9 A resolution. , 1993, Journal of molecular biology.

[22]  J. Skolnick,et al.  EFICAz: a comprehensive approach for accurate genome-scale enzyme function inference. , 2004, Nucleic acids research.

[23]  A. Valencia,et al.  Practical limits of function prediction , 2000, Proteins.

[24]  T. Stachelhaus,et al.  The specificity-conferring code of adenylation domains in nonribosomal peptide synthetases. , 1999, Chemistry & biology.

[25]  P E Bourne,et al.  The protein kinase resource. , 1997, Trends in biochemical sciences.

[26]  K Namba,et al.  Structure of 3-isopropylmalate dehydrogenase in complex with 3-isopropylmalate at 2.0 A resolution: the role of Glu88 in the unique substrate-recognition mechanism. , 1998, Structure.

[27]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[28]  Nguyen-Huu Xuong,et al.  Crystal structure of the catalytic subunit of cAMP-dependent protein kinase complexed with magnesium-ATP and peptide inhibitor , 1993 .

[29]  Oliver Eulenstein,et al.  Bioinformatics Research and Applications , 2008 .

[30]  Tilmann Weber,et al.  Specificity prediction of adenylation domains in nonribosomal peptide synthetases (NRPS) using transductive support vector machines (TSVMs) , 2005, Nucleic acids research.

[31]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[32]  D E Koshland,et al.  Orbital steering in the catalytic power of enzymes: small structural changes with large catalytic consequences. , 1997, Science.

[33]  C. Notredame,et al.  Using multiple alignment methods to assess the quality of genomic data analysis. , 2003 .

[34]  T. Hunter,et al.  The eukaryotic protein kinase superfamily: kinase (catalytic) domain structure and classification 1 , 1995, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[35]  Minoru Kanehisa,et al.  AAindex: Amino Acid index database , 2000, Nucleic Acids Res..

[36]  Bernhard Schölkopf,et al.  Kernel Methods in Computational Biology , 2005 .

[37]  Vladimir Naumovich Vapni The Nature of Statistical Learning Theory , 1995 .

[38]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[39]  Hans-Peter Lenhof,et al.  BALL-rapid software prototyping in computational molecular biology , 2000, Bioinform..

[40]  Susan S. Taylor,et al.  2.2 A refined crystal structure of the catalytic subunit of cAMP-dependent protein kinase complexed with MnATP and a peptide inhibitor. , 1993, Acta crystallographica. Section D, Biological crystallography.

[41]  A. Lesk,et al.  The relation between the divergence of sequence and structure in proteins. , 1986, The EMBO journal.

[42]  Ying Huang,et al.  EFICAz2: enzyme function inference by a combined approach enhanced by machine learning , 2009, BMC Bioinformatics.

[43]  Antje Chang,et al.  BRENDA, AMENDA and FRENDA: the enzyme information system in 2007 , 2007, Nucleic Acids Res..

[44]  Casimir A. Kulikowski,et al.  A Class of Evolution-Based Kernels for Protein Homology Analysis: A Generalization of the PAM Model , 2009, ISBRA.

[45]  T. Hunter,et al.  The eukaryotic protein kinase superfamily: kinase (catalytic) domain structure and classification 1 , 1995, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.