Analysis and prediction of functional sub-types from protein sequence alignments.

The increasing number and diversity of protein sequence families requires new methods to define and predict details regarding function. Here, we present a method for analysis and prediction of functional sub-types from multiple protein sequence alignments. Given an alignment and set of proteins grouped into sub-types according to some definition of function, such as enzymatic specificity, the method identifies positions that are indicative of functional differences by comparison of sub-type specific sequence profiles, and analysis of positional entropy in the alignment. Alignment positions with significantly high positional relative entropy correlate with those known to be involved in defining sub-types for nucleotidyl cyclases, protein kinases, lactate/malate dehydrogenases and trypsin-like serine proteases. We highlight new positions for these proteins that suggest additional experiments to elucidate the basis of specificity. The method is also able to predict sub-type for unclassified sequences. We assess several variations on a prediction method, and compare them to simple sequence comparisons. For assessment, we remove close homologues to the sequence for which a prediction is to be made (by a sequence identity above a threshold). This simulates situations where a protein is known to belong to a protein family, but is not a close relative of another protein of known sub-type. Considering the four families above, and a sequence identity threshold of 30 %, our best method gives an accuracy of 96 % compared to 80 % obtained for sequence similarity and 74 % for BLAST. We describe the derivation of a set of sub-type groupings derived from an automated parsing of alignments from PFAM and the SWISSPROT database, and use this to perform a large-scale assessment. The best method gives an average accuracy of 94 % compared to 68 % for sequence similarity and 79 % for BLAST. We discuss implications for experimental design, genome annotation and the prediction of protein function and protein intra-residue distances.

[1]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[2]  Claude E. Shannon,et al.  The Mathematical Theory of Communication , 1950 .

[3]  P Argos,et al.  Exploring structural homology of proteins. , 1976, Journal of molecular biology.

[4]  A. Fersht Enzyme structure and mechanism , 1977 .

[5]  R M Stroud,et al.  The crystal structure of alpha-bungarotoxin at 2.5 A resolution: relation to solution structure and binding to acetylcholine receptor. , 1986, Protein engineering.

[6]  W. Taylor,et al.  The classification of amino acid conservation. , 1986, Journal of theoretical biology.

[7]  M. Sternberg,et al.  Prediction of protein secondary structure and active sites using the alignment of homologous sequences. , 1987, Journal of molecular biology.

[8]  K. Walsh The Protein Kinase Family , 1987 .

[9]  H. Muirhead,et al.  A specific, highly active malate dehydrogenase by redesign of a lactate dehydrogenase framework. , 1988, Science.

[10]  T. Hunter,et al.  The protein kinase family: conserved features and deduced phylogeny of the catalytic domains. , 1988, Science.

[11]  M. Gribskov,et al.  [9] Profile analysis , 1990 .

[12]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[13]  W. Hol,et al.  Refined crystal structure of lipoamide dehydrogenase from Azotobacter vinelandii at 2.2 A resolution. A comparison with the structure of glutathione reductase. , 1991, Journal of molecular biology.

[14]  G. Barton,et al.  Multiple protein sequence alignment from tertiary structure comparison: Assignment of global and residue confidence levels , 1992, Proteins.

[15]  G J Barton,et al.  ALSCRIPT: a tool to format multiple sequence alignments. , 1993, Protein engineering.

[16]  Geoffrey J. Barton,et al.  Protein sequence alignments: a strategy for the hierarchical analysis of residue conservation , 1993, Comput. Appl. Biosci..

[17]  A. Murzin Can homologous proteins evolve different enzymatic activities? , 1993, Trends in biochemical sciences.

[18]  A. Danchin Phylogeny of adenylyl cyclases. , 1993, Advances in second messenger and phosphoprotein research.

[19]  C. Sander,et al.  Correlated mutations and residue contacts in proteins , 1994, Proteins.

[20]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[21]  G J Barton,et al.  Structural features can be unconserved in proteins with similar folds. An analysis of side-chain to side-chain contacts secondary structure and accessibility. , 1994, Journal of molecular biology.

[22]  E. Neher How frequent are correlated changes in families of protein sequences? , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[23]  K. Hatrick,et al.  Compensating changes in protein multiple sequence alignments. , 1994, Protein engineering.

[24]  T. Hunter,et al.  The eukaryotic protein kinase superfamily: kinase (catalytic) domain structure and classification 1 , 1995, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[25]  M. Swindells,et al.  Intrinsic phi, psi propensities of amino acids, derived from the coil regions of known structures. , 1995, Nature structural biology.

[26]  Susan S. Taylor,et al.  How do protein kinases discriminate between serine/threonine and tyrosine? Structural insights from the insulin receptor protein‐tyrosine kinase , 1995, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[27]  C. Sander,et al.  A method to predict functional residues in proteins , 1995, Nature Structural Biology.

[28]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[29]  R A Sayle,et al.  RASMOL: biomolecular graphics for all. , 1995, Trends in biochemical sciences.

[30]  M. Swindells,et al.  Intrinsic φ,ψ propensities of amino acids, derived from the coil regions of known structures , 1995, Nature Structural Biology.

[31]  J. Holbrook,et al.  Guided evolution of enzymes with new substrate specificities. , 1996, Journal of molecular biology.

[32]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its new supplement TREMBL , 1996, Nucleic Acids Res..

[33]  F E Cohen,et al.  Evolutionarily conserved Galphabetagamma binding surfaces support a model of the G protein-receptor complex. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[34]  F. Cohen,et al.  An evolutionary trace method defines binding surfaces common to protein families. , 1996, Journal of molecular biology.

[35]  T J Gibson,et al.  PairWise and SearchWise: finding the optimal alignment in a simultaneous comparison of a protein profile against all DNA translation frames. , 1996, Nucleic acids research.

[36]  A. Valencia,et al.  Improving contact predictions by the combination of correlated mutations and other sources of sequence information. , 1997, Folding & design.

[37]  A. Valencia,et al.  Correlated mutations contain information about protein-protein interaction. , 1997, Journal of molecular biology.

[38]  P E Bourne,et al.  The protein kinase resource. , 1997, Trends in biochemical sciences.

[39]  F E Cohen,et al.  Identification of functional surfaces of the zinc binding domains of intracellular receptors. , 1997, Journal of molecular biology.

[40]  J. Hurley,et al.  Structure of the adenylyl cyclase catalytic core , 1997, Nature.

[41]  A. Valencia,et al.  Shaping of Drosophila Alcohol Dehydrogenase Through Evolution: Relationship with Enzyme Functionality , 1998, Journal of Molecular Evolution.

[42]  Durbin,et al.  Biological Sequence Analysis , 1998 .

[43]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[44]  J B Hurley,et al.  Two amino acid substitutions convert a guanylyl cyclase, RetGC-1, into an adenylyl cyclase. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[45]  Temple F. Smith,et al.  Comparison of the complete protein sets of worm and yeast: orthology and divergence. , 1998, Science.

[46]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[47]  Kimmen Sjölander,et al.  Phylogenetic Inference in Protein Superfamilies: Analysis of SH2 Domains , 1998, ISMB.

[48]  Peer Bork,et al.  SMART, a simple modular architecture research tool , 1998 .

[49]  M J Sternberg,et al.  Supersites within superfolds. Binding site similarity in the absence of homology. , 1998, Journal of molecular biology.

[50]  Miguel A. Andrade-Navarro,et al.  Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families , 1998, Bioinform..

[51]  S J Remington,et al.  Glycerol kinase from Escherichia coli and an Ala65-->Thr mutant: the crystal structures reveal conformational changes with implications for allosteric regulation. , 1998, Structure.

[52]  N. Grishin,et al.  The Zn-peptidase superfamily: functional convergence after evolutionary divergence. , 1999, Journal of molecular biology.

[53]  A. Fiser,et al.  Convergent evolution of Trichomonas vaginalis lactate dehydrogenase from malate dehydrogenase. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[54]  I R Vetter,et al.  Effector Recognition by the Small GTP-binding Proteins Ras and Ral* , 1999, The Journal of Biological Chemistry.

[55]  Robert D. Finn,et al.  Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins , 1999, Nucleic Acids Res..

[56]  A. Bairoch,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999 , 1999, Nucleic Acids Res..

[57]  D. Eisenberg,et al.  A combined algorithm for genome-wide prediction of protein function , 1999, Nature.

[58]  S. Brenner Errors in genome annotation. , 1999, Trends in genetics : TIG.

[59]  Miguel A. Andrade-Navarro Position-Specific Annotation of Protein Function Based on Multiple Homologs , 1999, ISMB.

[60]  D. Eisenberg,et al.  Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[61]  A Valencia,et al.  Model of the ran-RCC1 interaction using biochemical and docking experiments. , 1999, Journal of molecular biology.

[62]  Anton J. Enright,et al.  Protein interaction maps for complete genomes based on gene fusion events , 1999, Nature.

[63]  M A Andrade,et al.  Position-specific annotation of protein function based on multiple homologs. , 1999, Proceedings. International Conference on Intelligent Systems for Molecular Biology.

[64]  O. Lichtarge,et al.  A regulator of G protein signaling interaction surface linked to effector specificity. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[65]  F. Cohen,et al.  Co-evolution of proteins with their interaction partners. , 2000, Journal of molecular biology.