Estimation of amino acid residue substitution rates at local spatial regions and application in protein function inference: a Bayesian Monte Carlo approach.

The amino acid sequences of proteins provide rich information for inferring distant phylogenetic relationships and for predicting protein functions. Estimating the rate matrix of residue substitutions from amino acid sequences is also important because the rate matrix can be used to develop scoring matrices for sequence alignment. Here we use a continuous time Markov process to model the substitution rates of residues and develop a Bayesian Markov chain Monte Carlo method for rate estimation. We validate our method using simulated artificial protein sequences. Because different local regions such as binding surfaces and the protein interior core experience different selection pressures due to functional or stability constraints, we use our method to estimate the substitution rates of local regions. Our results show that the substitution rates are very different for residues in the buried core and residues on the solvent-exposed surfaces. In addition, the rest of the proteins on the binding surfaces also have very different substitution rates from residues. Based on these findings, we further develop a method for protein function prediction by surface matching using scoring matrices derived from estimated substitution rates for residues located on the binding surfaces. We show with examples that our method is effective in identifying functionally related proteins that have overall low sequence identity, a task known to be very challenging.

[1]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[2]  G. Grimmett,et al.  Probability and random processes , 2002 .

[3]  A M Lesk,et al.  Evolution of proteins formed by beta-sheets. II. The core of the immunoglobulin domains. , 1982, Journal of molecular biology.

[4]  A. Lesk,et al.  The relation between the divergence of sequence and structure in proteins. , 1986, The EMBO journal.

[5]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[6]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[7]  K. Dill Dominant forces in protein folding. , 1990, Biochemistry.

[8]  W. Pearson Rapid and sensitive sequence comparison with FASTP and FASTA. , 1990, Methods in enzymology.

[9]  W. Pearson Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. , 1991, Genomics.

[10]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[11]  William R. Taylor,et al.  The rapid generation of mutation data matrices from protein sequences , 1992, Comput. Appl. Biosci..

[12]  Z. Yang,et al.  Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. , 1993, Molecular biology and evolution.

[13]  N. Goldman,et al.  A codon-based model of nucleotide substitution for protein-coding DNA sequences. , 1994, Molecular biology and evolution.

[14]  Herbert Edelsbrunner,et al.  Three-dimensional alpha shapes , 1994, ACM Trans. Graph..

[15]  A. Bairoch The ENZYME data bank. , 1993, Nucleic acids research.

[16]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[17]  Herbert Edelsbrunner,et al.  Measuring proteins and voids in proteins , 1995, Proceedings of the Twenty-Eighth Annual Hawaii International Conference on System Sciences.

[18]  D. Yee,et al.  Principles of protein folding — A perspective from simple exact models , 1995, Protein science : a publication of the Protein Society.

[19]  J. Felsenstein,et al.  A Hidden Markov Model approach to variation among sites in rate of evolution. , 1996, Molecular biology and evolution.

[20]  Correlating structure-dependent mutation matrices with physical-chemical properties. , 1996, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[21]  J. Adachi,et al.  MOLPHY version 2.3 : programs for molecular phylogenetics based on maximum likelihood , 1996 .

[22]  David C. Jones,et al.  Combining protein evolution and secondary structure. , 1996, Molecular biology and evolution.

[23]  David C. Jones,et al.  Using evolutionary trees in protein secondary structure prediction and other comparative sequence analyses. , 1996, Journal of molecular biology.

[24]  R A Goldstein,et al.  Evolution of model proteins on a foldability landscape , 1997, Proteins.

[25]  Ziheng Yang,et al.  PAML: a program package for phylogenetic analysis by maximum likelihood , 1997, Comput. Appl. Biosci..

[26]  Richard A. Goldstein,et al.  The Foldability Landscape Of , 1997 .

[27]  Gapped BLAST and PSI-BLAST: A new , 1997 .

[28]  B. Rannala,et al.  Bayesian phylogenetic inference using DNA sequences: a Markov Chain Monte Carlo Method. , 1997, Molecular biology and evolution.

[29]  R A Goldstein,et al.  Mutation matrices and physical‐chemical properties: Correlations and implications , 1997, Proteins.

[30]  Ziheng Yang,et al.  STATISTICAL TESTS OF HOST‐PARASITE COSPECIATION , 1997, Evolution; international journal of organic evolution.

[31]  P. Lio’,et al.  Models of molecular evolution and phylogeny. , 1998, Genome research.

[32]  W. Pearson Empirical statistical estimates for sequence similarity searches. , 1998, Journal of molecular biology.

[33]  H. Edelsbrunner,et al.  Anatomy of protein pockets and cavities: Measurement of binding site geometry and implications for ligand design , 1998, Protein science : a publication of the Protein Society.

[34]  David C. Jones,et al.  Assessing the impact of secondary structure and solvent accessibility on protein evolution. , 1998, Genetics.

[35]  Herbert Edelsbrunner,et al.  On the Definition and the Construction of Pockets in Macromolecules , 1998, Discret. Appl. Math..

[36]  Pietro Liò,et al.  PASSML: combining evolutionary inference and protein secondary structure prediction , 1998, Bioinform..

[37]  H. Kishino,et al.  Estimating the rate of evolution of the rate of molecular evolution. , 1998, Molecular biology and evolution.

[38]  Z. Yang,et al.  Models of amino acid substitution and applications to mitochondrial protein evolution. , 1998, Molecular biology and evolution.

[39]  S Subramaniam,et al.  Analytical shape computation of macromolecules: I. molecular area and volume through alpha shape , 1998, Proteins.

[40]  G. Mitchison A Probabilistic Treatment of Phylogeny and Sequence Alignment , 1999, Journal of Molecular Evolution.

[41]  M A Newton,et al.  Bayesian Phylogenetic Inference via Markov Chain Monte Carlo Methods , 1999, Biometrics.

[42]  P. Lio’,et al.  Using protein structural information in evolutionary inference: transmembrane proteins. , 1999, Molecular biology and evolution.

[43]  John P. Huelsenbeck,et al.  Variation in the Pattern of Nucleotide Substitution Across Sites , 1999, Journal of Molecular Evolution.

[44]  D. Hartl,et al.  Solvent accessibility and purifying selection within proteins of Escherichia coli and Salmonella enterica. , 2000, Molecular biology and evolution.

[45]  A BAYESIAN FRAMEWORK FOR THE ANALYSIS OF COSPECIATION , 2000, Evolution; international journal of organic evolution.

[46]  Hani Doss,et al.  Phylogenetic Tree Construction Using Markov Chain Monte Carlo , 2000 .

[47]  N. Goldman,et al.  Codon-substitution models for heterogeneous selection pressure at amino acid sites. , 2000, Genetics.

[48]  W. Li,et al.  Selective constraints, amino acid composition, and the rate of protein evolution. , 2000, Molecular biology and evolution.

[49]  Joseph Felsenstein,et al.  Taking Variation of Evolutionary Rates Between Sites into Account in Inferring Phylogenies , 2001, Journal of Molecular Evolution.

[50]  Jonathan P. Bollback,et al.  Bayesian Inference of Phylogeny and Its Impact on Evolutionary Biology , 2001, Science.

[51]  K A Dill,et al.  Are proteins well-packed? , 2001, Biophysical journal.

[52]  S. Whelan,et al.  A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. , 2001, Molecular biology and evolution.

[53]  P. Lio’,et al.  Molecular phylogenetics: state-of-the-art methods for looking into the past. , 2001, Trends in genetics : TIG.

[54]  Jonathan P. Bollback,et al.  Empirical and hierarchical Bayesian estimation of ancestral states. , 2001, Systematic biology.

[55]  J. Echave,et al.  Structural constraints and emergence of sequence patterns in protein evolution. , 2001, Molecular biology and evolution.

[56]  J. Huelsenbeck,et al.  Potential applications and pitfalls of Bayesian inference of phylogeny. , 2002, Systematic biology.

[57]  Jonathan P. Bollback,et al.  Inferring the root of a phylogenetic tree. , 2002, Systematic biology.

[58]  B. Rost Enzyme function less conserved than anticipated. , 2002, Journal of molecular biology.

[59]  Jie Liang,et al.  Simplicial edge representation of protein structures and alpha contact potential with confidence measure , 2003, Proteins.

[60]  J. Skolnick,et al.  How well is enzyme function conserved as a function of pairwise sequence identity? , 2003, Journal of molecular biology.

[61]  N O Stitziel,et al.  STRUCTURAL LOCATION OF DISEASEASSOCIATED SINGLE-NUCLEOTIDE POLYMORPHISMS , 2003 .

[62]  Jie Liang,et al.  Inferring functional relationships of proteins from local sequence and spatial surface patterns. , 2003, Journal of molecular biology.

[63]  S. Kasif,et al.  Structural location of disease-associated single-nucleotide polymorphisms. , 2003, Journal of molecular biology.

[64]  David T. Jones,et al.  Protein evolution with dependence among codons due to tertiary structure. , 2003, Molecular biology and evolution.

[65]  X. Gu,et al.  Natural history and functional divergence of protein tyrosine kinases. , 2003, Gene.

[66]  Jie Liang,et al.  CASTp: Computed Atlas of Surface Topography of proteins , 2003, Nucleic Acids Res..

[67]  R. Russell Faculty Opinions recommendation of Integrating structure, bioinformatics, and enzymology to discover function: BioH, a new carboxylesterase from Escherichia coli. , 2003 .

[68]  J. Thornton,et al.  Integrating Structure, Bioinformatics, and Enzymology to Discover Function , 2003, Journal of Biological Chemistry.

[69]  Ziheng Yang Estimating the pattern of nucleotide substitution , 1994, Journal of Molecular Evolution.

[70]  Jie Liang,et al.  Are residues in a protein folding nucleus evolutionarily conserved? , 2003, Journal of molecular biology.

[71]  Christian P. Robert,et al.  Monte Carlo Statistical Methods , 2005, Springer Texts in Statistics.

[72]  Ziheng Yang Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods , 1994, Journal of Molecular Evolution.

[73]  C. Chothia,et al.  Structure, function and evolution of multidomain proteins. , 2004, Current opinion in structural biology.

[74]  D. Haussler,et al.  Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. , 2003, Molecular biology and evolution.

[75]  Jie Liang,et al.  pvSOAR: detecting similar surface patterns of pocket and void surfaces of amino acid residues on proteins , 2004, Nucleic Acids Res..

[76]  N. Ben-Tal,et al.  Comparison of site-specific rate-inference methods for protein sequences: empirical Bayesian methods are superior. , 2004, Molecular biology and evolution.

[77]  Jie Liang,et al.  Geometric cooperativity and anticooperativity of three‐body interactions in native proteins , 2005, Proteins.

[78]  J. Echave,et al.  Generality of the structurally constrained protein evolution model: assessment on representatives of the four main fold classes. , 2005, Gene.

[79]  Arthur M Lesk,et al.  Structural divergence and distant relationships in proteins: evolution of the globins. , 2005, Current opinion in structural biology.

[80]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.