Ideal amino acid exchange forms for approximating substitution matrices

We have analyzed 29 published substitution matrices (SMs) and five statistical protein contact potentials (CPs) for comparison. We find that popular, ‘classical’ SMs obtained mainly from sequence alignments of globular proteins are mostly correlated by at least a value of 0.9. The BLOSUM62 is the central element of this group. A second group includes SMs derived from alignments of remote homologs or transmembrane proteins. These matrices correlate better with classical SMs (0.8) than among themselves (0.7). A third group consists of intermediate links between SMs and CPs ‐ matrices and potentials that exhibit mutual correlations of at least 0.8. Next, we show that SMs can be approximated with a correlation of 0.9 by expressions c0 + xixj + yiyj + zizj, 1≤ i, j ≤ 20, where c0 is a constant and the vectors (xi), (yi), (zi) correlate highly with hydrophobicity, molecular volume and coil preferences of amino acids, respectively. The present paper is the continuation of our work (Pokarowski et al., Proteins 2005;59:49–57), where similar approximation were used to derive ideal amino acid interaction forms from CPs. Both approximations allow us to understand general trends in amino acid similarity and can help improve multiple sequence alignments using the fast Fourier transform (MAFFT), fast threading or another methods based on alignments of physicochemical profiles of protein sequences. The use of this approximation in sequence alignments instead of a classical SM yields results that differ by less than 5%. Intermediate links between SMs and CPs, new formulas for approximating these matrices, and the highly significant dependence of classical SMs on coil preferences are new findings. Proteins 2007. © 2007 Wiley‐Liss, Inc.

[1]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[2]  Patrick Doherty,et al.  Inter-residue distances derived from fold contact propensities correlate with evolutionary substitution costs , 2004, BMC Bioinformatics.

[3]  G. Crippen,et al.  Contact potential that recognizes the correct folding of globular proteins. , 1992, Journal of molecular biology.

[4]  N. L. Johnson,et al.  Linear Statistical Inference and Its Applications , 1966 .

[5]  David C. Jones,et al.  A mutation data matrix for transmembrane proteins , 1994, FEBS letters.

[6]  Sven Rahmann,et al.  Non-symmetric score matrices and the detection of homologous transmembrane proteins , 2001, ISMB.

[7]  M. Levitt A simplified representation of protein conformations for rapid simulation of protein folding. , 1976, Journal of molecular biology.

[8]  D. T. Jones,et al.  A new approach to protein fold recognition , 1992, Nature.

[9]  Jens Meiler,et al.  Rosetta predictions in CASP5: Successes, failures, and prospects for complete automation , 2003, Proteins.

[10]  R Nussinov,et al.  Interchanges of spatially neighbouring residues in structurally conserved environments. , 1997, Protein engineering.

[11]  V. Muñoz,et al.  Intrinsic secondary structure propensities of the amino acids, using statistical phi-psi matrices: comparison with experimental scales. , 1994, Proteins.

[12]  R. Jernigan,et al.  A new substitution matrix for protein sequence searches based on contact frequencies in protein structures. , 1993, Protein engineering.

[13]  N. L. Johnson,et al.  Multivariate Analysis , 1958, Nature.

[14]  P Argos,et al.  Protein secondary structure. Studies on the limits of prediction accuracy. , 2009, International journal of peptide and protein research.

[15]  Steven E. Brenner,et al.  Bootstrapping and normalization for enhanced evaluations of pairwise sequence comparison , 2002, Proc. IEEE.

[16]  J. M. Zimmerman,et al.  The characterization of amino acid sequences in proteins by statistical methods. , 1968, Journal of theoretical biology.

[17]  Andrzej Kloczkowski,et al.  Inferring ideal amino acid interaction forms from statistical protein contact potentials , 2005, Proteins.

[18]  M. Sternberg,et al.  Recognition of analogous and homologous protein folds: analysis of sequence and structure conservation. , 1997, Journal of molecular biology.

[19]  M Ohya,et al.  Amino acid similarity matrix for homology modeling derived from structural alignment and optimized by the Monte Carlo method. , 1998, Journal of molecular graphics & modelling.

[20]  Andrew E. Torda,et al.  Amino acid similarity matrices based on force fields , 2001, Bioinform..

[21]  A. Komoriya,et al.  Local interactions as a structure determinant for protein molecules: III. , 1979, Biochimica et biophysica acta.

[22]  Steven E. Brenner,et al.  An alternative model of amino acid replacement , 2004, Bioinform..

[23]  R. Aurora,et al.  Helix capping , 1998, Protein science : a publication of the Protein Society.

[24]  A. Mclachlan Tests for comparing related amino-acid sequences. Cytochrome c and cytochrome c 551 . , 1971, Journal of molecular biology.

[25]  Drena Dobbs,et al.  Three-dimensional threading approach to protein structure recognition , 2004 .

[26]  M. Sippl,et al.  Structure-derived substitution matrices for alignment of distantly related sequences. , 2000, Protein engineering.

[27]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[28]  F E Cohen,et al.  Pairwise sequence alignment below the twilight zone. , 2001, Journal of molecular biology.

[29]  M. Kanehisa,et al.  Analysis of amino acid indices and mutation matrices for sequence comparison and structure prediction of proteins. , 1996, Protein engineering.

[30]  G. Gonnet,et al.  Exhaustive matching of the entire protein sequence database. , 1992, Science.

[31]  Jorja G. Henikoff,et al.  PHAT: a transmembrane-specific substitution matrix , 2000, Bioinform..

[32]  R. Jernigan,et al.  Self‐consistent estimation of inter‐residue protein contact energies based on an equilibrium mixture approximation of residues , 1999, Proteins.

[33]  R. Grantham Amino Acid Difference Formula to Help Explain Protein Evolution , 1974, Science.

[34]  Harold A. Scheraga,et al.  Helix-coil stability constants for the naturally occurring amino acids in water. 22. Histidine parameters from random poly[(hydroxybutyl)glutamine-co-L-histidine] , 1984 .

[35]  C. Sander,et al.  Antiparallel and parallel beta-strands differ in amino acid residue preferences. , 1979, Nature.

[36]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[37]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[38]  D. Baker,et al.  Improved recognition of native‐like protein structures using a combination of sequence‐dependent and sequence‐independent features of proteins , 1999, Proteins.

[39]  N. Wingreen,et al.  NATURE OF DRIVING FORCE FOR PROTEIN FOLDING : A RESULT FROM ANALYZING THE STATISTICAL POTENTIAL , 1995, cond-mat/9512111.

[40]  S A Benner,et al.  Amino acid substitution during functionally constrained divergent evolution of protein sequences. , 1994, Protein engineering.

[41]  H. Wolfson,et al.  Amino acid pair interchanges at spatially conserved locations. , 1996, Journal of molecular biology.

[42]  William R. Taylor,et al.  The rapid generation of mutation data matrices from protein sequences , 1992, Comput. Appl. Biosci..

[43]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[44]  Kuang Lin,et al.  Amino Acid Substitution Matrices from an Artificial Neural Network Model , 2001, J. Comput. Biol..

[45]  R. Altman,et al.  Using the radial distributions of physical features to compare amino acid environments and align amino acid sequences. , 1997, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[46]  John P. Overington,et al.  HOMSTRAD: A database of protein structure alignments for homologous families , 1998, Protein science : a publication of the Protein Society.

[47]  C Kooperberg,et al.  Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. , 1997, Journal of molecular biology.

[48]  Shneior Lifson,et al.  Antiparallel and parallel β-strands differ in amino acid residue preferences , 1979, Nature.

[49]  A C May,et al.  Towards more meaningful hierarchical classification of amino acid scoring matrices. , 1999, Protein engineering.

[50]  B. Robson,et al.  Conformational properties of amino acid residues in globular proteins. , 1976, Journal of molecular biology.

[51]  John P. Overington,et al.  A structural basis for sequence comparisons. An evaluation of scoring methodologies. , 1993, Journal of molecular biology.

[52]  A. Godzik,et al.  Are proteins ideal mixtures of amino acids? Analysis of energy parameter sets , 1995, Protein science : a publication of the Protein Society.

[53]  L. Kier,et al.  Amino acid side chain parameters for correlation studies in biology and pharmacology. , 2009, International journal of peptide and protein research.

[54]  Helix-coil stability constants for the naturally occurring amino acids in water. 19. Isoleucine parameters from random poly[(hydroxypropyl)glutamine-co-L-isoleucine] , 1981 .

[55]  P. Y. Chou,et al.  Prediction of the secondary structure of proteins from their amino acid sequence. , 2006 .

[56]  A. Komoriya,et al.  Local interactions as a structure determinant for protein molecules: II. , 1979, Biochimica et biophysica acta.

[57]  M. Vihinen,et al.  Accuracy of protein flexibility predictions , 1994, Proteins.

[58]  K. Katoh,et al.  MAFFT version 5: improvement in accuracy of multiple sequence alignment , 2005, Nucleic acids research.

[59]  R. Doolittle,et al.  A simple method for displaying the hydropathic character of a protein. , 1982, Journal of molecular biology.

[60]  M. O. Dayhoff A model of evolutionary change in protein , 1978 .

[61]  M. Levitt Conformational preferences of amino acids in globular proteins. , 1978, Biochemistry.

[62]  N. Linial,et al.  On the design and analysis of protein folding potentials , 2000, Proteins.

[63]  R. Spang,et al.  Estimating amino acid substitution models: a comparison of Dayhoff's estimator, the resolvent approach and a maximum likelihood method. , 2002, Molecular biology and evolution.

[64]  V. Muñoz,et al.  Intrinsic secondary structure propensities of the amino acids, using statistical ϕ–ψ matrices: Comparison with experimental scales , 1994 .

[65]  Hiroyuki Ogata,et al.  AAindex: Amino Acid Index Database , 1999, Nucleic Acids Res..

[66]  Bin Qian,et al.  Optimization of a new score function for the generation of accurate alignments , 2002, Proteins.

[67]  Akira R. Kinjo,et al.  Eigenvalue analysis of amino acid substitution matrices reveals a sharp transition of the mode of sequence conservation in proteins , 2004, Bioinform..