论文信息 - Analysis and prediction of functional sub-types from protein sequence alignments.

Analysis and prediction of functional sub-types from protein sequence alignments.

The increasing number and diversity of protein sequence families requires new methods to define and predict details regarding function. Here, we present a method for analysis and prediction of functional sub-types from multiple protein sequence alignments. Given an alignment and set of proteins grouped into sub-types according to some definition of function, such as enzymatic specificity, the method identifies positions that are indicative of functional differences by comparison of sub-type specific sequence profiles, and analysis of positional entropy in the alignment. Alignment positions with significantly high positional relative entropy correlate with those known to be involved in defining sub-types for nucleotidyl cyclases, protein kinases, lactate/malate dehydrogenases and trypsin-like serine proteases. We highlight new positions for these proteins that suggest additional experiments to elucidate the basis of specificity. The method is also able to predict sub-type for unclassified sequences. We assess several variations on a prediction method, and compare them to simple sequence comparisons. For assessment, we remove close homologues to the sequence for which a prediction is to be made (by a sequence identity above a threshold). This simulates situations where a protein is known to belong to a protein family, but is not a close relative of another protein of known sub-type. Considering the four families above, and a sequence identity threshold of 30 %, our best method gives an accuracy of 96 % compared to 80 % obtained for sequence similarity and 74 % for BLAST. We describe the derivation of a set of sub-type groupings derived from an automated parsing of alignments from PFAM and the SWISSPROT database, and use this to perform a large-scale assessment. The best method gives an average accuracy of 94 % compared to 68 % for sequence similarity and 79 % for BLAST. We discuss implications for experimental design, genome annotation and the prediction of protein function and protein intra-residue distances.

R. Russell | S. Hannenhalli | R. Russell

[1] C. E. SHANNON,et al. A mathematical theory of communication , 1948, MOCO.

[2] Claude E. Shannon,et al. The Mathematical Theory of Communication , 1950 .

[3] P Argos,et al. Exploring structural homology of proteins. , 1976, Journal of molecular biology.

[4] A. Fersht. Enzyme structure and mechanism , 1977 .

[5] R M Stroud,et al. The crystal structure of alpha-bungarotoxin at 2.5 A resolution: relation to solution structure and binding to acetylcholine receptor. , 1986, Protein engineering.

[6] W. Taylor,et al. The classification of amino acid conservation. , 1986, Journal of theoretical biology.

[7] M. Sternberg,et al. Prediction of protein secondary structure and active sites using the alignment of homologous sequences. , 1987, Journal of molecular biology.

[8] K. Walsh. The Protein Kinase Family , 1987 .

[9] H. Muirhead,et al. A specific, highly active malate dehydrogenase by redesign of a lactate dehydrogenase framework. , 1988, Science.

[10] T. Hunter,et al. The protein kinase family: conserved features and deduced phylogeny of the catalytic domains. , 1988, Science.

[11] M. Gribskov,et al. [9] Profile analysis , 1990 .

[12] E. Myers,et al. Basic local alignment search tool. , 1990, Journal of molecular biology.

[13] W. Hol,et al. Refined crystal structure of lipoamide dehydrogenase from Azotobacter vinelandii at 2.2 A resolution. A comparison with the structure of glutathione reductase. , 1991, Journal of molecular biology.

[14] G. Barton,et al. Multiple protein sequence alignment from tertiary structure comparison: Assignment of global and residue confidence levels , 1992, Proteins.

[15] G J Barton,et al. ALSCRIPT: a tool to format multiple sequence alignments. , 1993, Protein engineering.

[16] Geoffrey J. Barton,et al. Protein sequence alignments: a strategy for the hierarchical analysis of residue conservation , 1993, Comput. Appl. Biosci..

[17] A. Murzin. Can homologous proteins evolve different enzymatic activities? , 1993, Trends in biochemical sciences.

[18] A. Danchin. Phylogeny of adenylyl cyclases. , 1993, Advances in second messenger and phosphoprotein research.

[19] C. Sander,et al. Correlated mutations and residue contacts in proteins , 1994, Proteins.

[20] D. Haussler,et al. Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[21] G J Barton,et al. Structural features can be unconserved in proteins with similar folds. An analysis of side-chain to side-chain contacts secondary structure and accessibility. , 1994, Journal of molecular biology.

[22] E. Neher. How frequent are correlated changes in families of protein sequences? , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[23] K. Hatrick,et al. Compensating changes in protein multiple sequence alignments. , 1994, Protein engineering.

[24] T. Hunter,et al. The eukaryotic protein kinase superfamily: kinase (catalytic) domain structure and classification 1 , 1995, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[25] M. Swindells,et al. Intrinsic phi, psi propensities of amino acids, derived from the coil regions of known structures. , 1995, Nature structural biology.

[26] Susan S. Taylor,et al. How do protein kinases discriminate between serine/threonine and tyrosine? Structural insights from the insulin receptor protein‐tyrosine kinase , 1995, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[27] C. Sander,et al. A method to predict functional residues in proteins , 1995, Nature Structural Biology.

[28] A G Murzin,et al. SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[29] R A Sayle,et al. RASMOL: biomolecular graphics for all. , 1995, Trends in biochemical sciences.

[30] M. Swindells,et al. Intrinsic φ,ψ propensities of amino acids, derived from the coil regions of known structures , 1995, Nature Structural Biology.

[31] J. Holbrook,et al. Guided evolution of enzymes with new substrate specificities. , 1996, Journal of molecular biology.

[32] Rolf Apweiler,et al. The SWISS-PROT protein sequence data bank and its new supplement TREMBL , 1996, Nucleic Acids Res..

[33] F E Cohen,et al. Evolutionarily conserved Galphabetagamma binding surfaces support a model of the G protein-receptor complex. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[34] F. Cohen,et al. An evolutionary trace method defines binding surfaces common to protein families. , 1996, Journal of molecular biology.

[35] T J Gibson,et al. PairWise and SearchWise: finding the optimal alignment in a simultaneous comparison of a protein profile against all DNA translation frames. , 1996, Nucleic acids research.

[36] A. Valencia,et al. Improving contact predictions by the combination of correlated mutations and other sources of sequence information. , 1997, Folding & design.

[37] A. Valencia,et al. Correlated mutations contain information about protein-protein interaction. , 1997, Journal of molecular biology.

[38] P E Bourne,et al. The protein kinase resource. , 1997, Trends in biochemical sciences.

[39] F E Cohen,et al. Identification of functional surfaces of the zinc binding domains of intracellular receptors. , 1997, Journal of molecular biology.

[40] J. Hurley,et al. Structure of the adenylyl cyclase catalytic core , 1997, Nature.

[41] A. Valencia,et al. Shaping of Drosophila Alcohol Dehydrogenase Through Evolution: Relationship with Enzyme Functionality , 1998, Journal of Molecular Evolution.

[42] Durbin,et al. Biological Sequence Analysis , 1998 .

[43] Sean R. Eddy,et al. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[44] J B Hurley,et al. Two amino acid substitutions convert a guanylyl cyclase, RetGC-1, into an adenylyl cyclase. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[45] Temple F. Smith,et al. Comparison of the complete protein sets of worm and yeast: orthology and divergence. , 1998, Science.

[46] Sean R. Eddy,et al. Profile hidden Markov models , 1998, Bioinform..

[47] Kimmen Sjölander,et al. Phylogenetic Inference in Protein Superfamilies: Analysis of SH2 Domains , 1998, ISMB.

[48] Peer Bork,et al. SMART, a simple modular architecture research tool , 1998 .

[49] M J Sternberg,et al. Supersites within superfolds. Binding site similarity in the absence of homology. , 1998, Journal of molecular biology.

[50] Miguel A. Andrade-Navarro,et al. Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families , 1998, Bioinform..

[51] S J Remington,et al. Glycerol kinase from Escherichia coli and an Ala65-->Thr mutant: the crystal structures reveal conformational changes with implications for allosteric regulation. , 1998, Structure.

[52] N. Grishin,et al. The Zn-peptidase superfamily: functional convergence after evolutionary divergence. , 1999, Journal of molecular biology.

[53] A. Fiser,et al. Convergent evolution of Trichomonas vaginalis lactate dehydrogenase from malate dehydrogenase. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[54] I R Vetter,et al. Effector Recognition by the Small GTP-binding Proteins Ras and Ral* , 1999, The Journal of Biological Chemistry.

[55] Robert D. Finn,et al. Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins , 1999, Nucleic Acids Res..

[56] A. Bairoch,et al. The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999 , 1999, Nucleic Acids Res..

[57] D. Eisenberg,et al. A combined algorithm for genome-wide prediction of protein function , 1999, Nature.

[58] S. Brenner. Errors in genome annotation. , 1999, Trends in genetics : TIG.

[59] Miguel A. Andrade-Navarro. Position-Specific Annotation of Protein Function Based on Multiple Homologs , 1999, ISMB.

[60] D. Eisenberg,et al. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[61] A Valencia,et al. Model of the ran-RCC1 interaction using biochemical and docking experiments. , 1999, Journal of molecular biology.

[62] Anton J. Enright,et al. Protein interaction maps for complete genomes based on gene fusion events , 1999, Nature.

[63] M A Andrade,et al. Position-specific annotation of protein function based on multiple homologs. , 1999, Proceedings. International Conference on Intelligent Systems for Molecular Biology.

[64] O. Lichtarge,et al. A regulator of G protein signaling interaction surface linked to effector specificity. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[65] F. Cohen,et al. Co-evolution of proteins with their interaction partners. , 2000, Journal of molecular biology.