Predicting and Annotating Catalytic Residues: An Information Theoretic Approach

We introduce a computational method to predict and annotate the catalytic residues of a protein using only its sequence information, so that we describe both the residues' sequence locations (prediction) and their specific biochemical roles in the catalyzed reaction (annotation). While knowing the chemistry of an enzyme's catalytic residues is essential to understanding its function, the challenges of prediction and annotation have remained difficult, especially when only the enzyme's sequence and no homologous structures are available. Our sequence-based approach follows the guiding principle that catalytic residues performing the same biochemical function should have similar chemical environments; it detects specific conservation patterns near in sequence to known catalytic residues and accordingly constrains what combination of amino acids can be present near a predicted catalytic residue. We associate with each catalytic residue a short sequence profile and define a Kullback-Leibler (KL) distance measure between these profiles, which, as we show, effectively captures even subtle biochemical variations. We apply the method to the class of glycohydrolase enzymes. This class includes proteins from 96 families with very different sequences and folds, many of which perform important functions. In a cross-validation test, our approach correctly predicts the location of the enzymes' catalytic residues with a sensitivity of 80% at a specificity of 99.4%, and in a separate cross-validation we also correctly annotate the biochemical role of 80% of the catalytic residues. Our results compare favorably to existing methods. Moreover, our method is more broadly applicable because it relies on sequence and not structure information; it may, furthermore, be used in conjunction with structure-based methods.

[1]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[2]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[3]  P. Roey,et al.  Crystal structure of endo-beta-N-acetylglucosaminidase H at 1.9 A resolution: active-site geometry and substrate recognition. , 1995, Structure.

[4]  Simon Kasif,et al.  Identification of functional links between genes using phylogenetic profiles , 2003, Bioinform..

[5]  Sergio Contrino,et al.  Protein Sequence Annotation in the Genome Era: The Annotation Concept of SWISS-PROT + TREMBL , 1997, ISMB.

[6]  U Heinemann,et al.  Crystal structure of Bacillus licheniformis 1,3‐1,4‐β‐d‐glucan 4‐glucanohydrolase at 1.8 Å resolution , 1995, FEBS letters.

[7]  V. Zverlov,et al.  The thermostable α‐l‐rhamnosidase RamA of Clostridium stercorarium: biochemical characterization and primary structure of a bacterial α‐l‐rhamnoside hydrolase, a new type of inverting glycoside hydrolase , 2000, Molecular microbiology.

[8]  Ashish V. Tendulkar,et al.  Functional sites in protein families uncovered via an objective and automated graph theoretic approach. , 2003, Journal of molecular biology.

[9]  D. Eisenberg,et al.  Three-dimensional cluster analysis identifies interfaces and functional residue clusters in proteins. , 2001, Journal of molecular biology.

[10]  N. Ben-Tal,et al.  ConSurf: an algorithmic tool for the identification of functional regions in proteins by surface mapping of phylogenetic information. , 2001, Journal of molecular biology.

[11]  A. Valencia,et al.  Automatic methods for predicting functionally important residues. , 2003, Journal of molecular biology.

[12]  Dedreia Tull,et al.  Crystallographic observation of a covalent catalytic intermediate in a β-glycosidase , 1996, Nature Structural Biology.

[13]  Tim J. P. Hubbard,et al.  SCOP: a Structural Classification of Proteins database , 1999, Nucleic Acids Res..

[14]  M. Eisenstein,et al.  Looking at enzymes from the inside out: the proximity of catalytic residues to the molecular centroid can be used for detection of active sites and enzyme-ligand interfaces. , 2005, Journal of molecular biology.

[15]  I S Mian,et al.  Sequence, structural, functional, and phylogenetic analyses of three glycosidase families. , 1998, Blood cells, molecules & diseases.

[16]  S. Brenner Errors in genome annotation. , 1999, Trends in genetics : TIG.

[17]  Pedro M. Coutinho,et al.  Carbohydrate-active enzymes : an integrated database approach , 1999 .

[18]  K. Nishikawa,et al.  Prediction of catalytic residues in enzymes based on known tertiary structure, stability profile, and sequence conservation. , 2003, Journal of molecular biology.

[19]  Laerte Oliveira,et al.  Identification of functionally conserved residues with the use of entropy–variability plots , 2003, Proteins.

[20]  R. Russell,et al.  Analysis and prediction of functional sub-types from protein sequence alignments. , 2000, Journal of molecular biology.

[21]  Janet M. Thornton,et al.  The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data , 2004, Nucleic Acids Res..

[22]  Anna R Panchenko,et al.  Finding weak similarities between proteins by sequence profile comparison. , 2003, Nucleic acids research.

[23]  L. Firsov,et al.  Refined structure for the complex of acarbose with glucoamylase from Aspergillus awamori var. X100 to 2.4-A resolution. , 1994, The Journal of biological chemistry.

[24]  Rolf Apweiler,et al.  InterProScan: protein domains identifier , 2005, Nucleic Acids Res..

[25]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[26]  Leszek Rychlewski,et al.  Improving the quality of twilight‐zone alignments , 2000, Protein science : a publication of the Protein Society.

[27]  Dennis R Livesay,et al.  The evolutionary origins and catalytic importance of conserved electrostatic networks within TIM‐barrel proteins , 2005, Protein science : a publication of the Protein Society.

[28]  I. Vlodavsky,et al.  Molecular properties and involvement of heparanase in cancer progression and normal development. , 2001, Biochimie.

[29]  Mikhail S. Gelfand,et al.  SDPpred: a tool for prediction of amino acid residues that determine differences in functional specificity of homologous proteins , 2004, Nucleic Acids Res..

[30]  Jinglie Dou,et al.  Homology similarity analysis of sequences of lactoferricin and its derivatives. , 2003, Journal of agricultural and food chemistry.

[31]  B. Henrissat,et al.  Structures and mechanisms of glycosyl hydrolases. , 1995, Structure.

[32]  Robert D. Finn,et al.  The Pfam protein families database , 2004, Nucleic Acids Res..

[33]  C. A. Andersen,et al.  Prediction of human protein function from post-translational modifications and localization features. , 2002, Journal of molecular biology.

[34]  B. Rost,et al.  Automatic prediction of protein function , 2003, Cellular and Molecular Life Sciences CMLS.

[35]  M. Huynen,et al.  Prediction of protein function and pathways in the genome era , 2004, Cellular and Molecular Life Sciences CMLS.

[36]  Dérick Rousseau,et al.  Pattern similarity analysis of amino acid sequences for peptide emulsification. , 2004, Journal of agricultural and food chemistry.

[37]  Ying Wei,et al.  Prediction of active sites for protein structures from computed chemical properties , 2005, ISMB.

[38]  O. Lichtarge,et al.  Structural clusters of evolutionary trace residues are statistically significant and common in proteins. , 2002, Journal of molecular biology.

[39]  M. Sternberg,et al.  Automated prediction of protein function and detection of functional sites from structure. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[40]  J. Whisstock,et al.  Prediction of protein function from protein sequence and structure , 2003, Quarterly Reviews of Biophysics.

[41]  Kuo-Chen Chou,et al.  A novel approach to predict active sites of enzyme molecules , 2004, Proteins.

[42]  Martin Vingron,et al.  T-Reg Comparator: an analysis tool for the comparison of position weight matrices , 2005, Nucleic Acids Res..

[43]  W. S. Valdar,et al.  Scoring residue conservation , 2002, Proteins.

[44]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[45]  Bart De Moor,et al.  Computational detection of cis-regulatory modules , 2003, ECCB.

[46]  R. Greaves,et al.  Active site identification through geometry-based and sequence profile-based calculations: burial of catalytic clefts. , 2005, Journal of molecular biology.

[47]  C. Sander,et al.  A method to predict functional residues in proteins , 1995, Nature Structural Biology.

[48]  Marcin P Joachimiak,et al.  JEvTrace: refinement and variations of the evolutionary trace in JAVA , 2002, Genome Biology.

[49]  Yasuo Suzuki,et al.  Sialobiology of influenza: molecular mechanism of host range variation of influenza viruses. , 2005, Biological & pharmaceutical bulletin.

[50]  Jinglie Dou,et al.  Pattern similarity study of functional sites in protein sequences: lysozymes and cystatins , 2005, BMC Biochemistry.

[51]  朝井 勇宣,et al.  液内培養によるアミラーゼの生産に就て:(第4報)糖化の伸びのよい Aspergillus awamori var. fumeus 6321の特性とそのアルコール醗酵に對する應用 , 1952 .

[52]  Adam Godzik,et al.  Tolerating some redundancy significantly speeds up clustering of large protein databases , 2002, Bioinform..

[53]  Brian T. Sutch,et al.  Predicting protein functional sites with phylogenetic motifs , 2004, Proteins.

[54]  Adam Godzik,et al.  Clustering of highly homologous sequences to reduce the size of large protein databases , 2001, Bioinform..

[55]  Tuan D. Pham,et al.  A probabilistic measure for alignment-free sequence comparison , 2004, Bioinform..

[56]  Luigi Naldini,et al.  Correction of mucopolysaccharidosis type IIIb fibroblasts by lentiviral vector-mediated gene transfer. , 2002, The Biochemical journal.

[57]  Amos Bairoch,et al.  Recent improvements to the PROSITE database , 2004, Nucleic Acids Res..

[58]  Richard Hughey,et al.  Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..