Grouping of amino acid types and extraction of amino acid properties from multiple sequence alignments using variance maximization

Understanding of amino acid type co‐occurrence in trusted multiple sequence alignments is a prerequisite for improved sequence alignment and remote homology detection algorithms. Two objective approaches were used to investigate co‐occurrence, both based on variance maximization of the weighted residue frequencies in columns taken from a large alignment database. The first approach discretely grouped amino acid types, and the second approach extracted orthogonal properties of amino acids using principal components analysis. The grouping results corresponded to amino acid physical properties such as side chain hydrophobicity, size, or backbone flexibility, and an optimal arrangement of approximately eight groups was observed. However, interpretation of the orthogonal properties was more complex. Although the principal components accounting for the largest variances exhibited modest correlations with hydrophobicity and conservation of glycine, in general principal components did not correspond to physical properties of amino acids. Although not intuitive, these amino acid mathematical properties were demonstrated to be robust and to improve local pairwise alignment accuracy, relative to 20 amino acid frequencies alone, for a simple test case. Proteins 2005. © 2005 Wiley‐Liss, Inc.

[1]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[2]  T. Jukes,et al.  The amino acid code. , 1978, Advances in enzymology and related areas of molecular biology.

[3]  R. Wolfenden,et al.  Water, protein folding, and the genetic code. , 1979, Science.

[4]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[5]  H. Scheraga,et al.  Statistical analysis of the physical properties of the 20 naturally occurring amino acids , 1985 .

[6]  B. Manly Multivariate Statistical Methods : A Primer , 1986 .

[7]  R. Hodges,et al.  New hydrophilicity scale derived from high-performance liquid chromatography peptide retention data: correlation of predicted surface residues with antigenicity and X-ray-derived accessible sites. , 1986, Biochemistry.

[8]  W. Taylor,et al.  The classification of amino acid conservation. , 1986, Journal of theoretical biology.

[9]  S. Wold,et al.  Principal property values for six non-natural amino acids and their application to a structure–activity relationship for oxytocin peptide analogues , 1987 .

[10]  L. Kier,et al.  Amino acid side chain parameters for correlation studies in biology and pharmacology. , 2009, International journal of peptide and protein research.

[11]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[12]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Desmond G. Higgins Sequence ordinations: a multivariate analysis approach to analysing large sequence data sets , 1992, Comput. Appl. Biosci..

[14]  S. Brunak,et al.  Neural network model of the genetic code is strongly correlated to the GES scale of amino acid transfer free energies. , 1994, Journal of molecular biology.

[15]  C. Sander,et al.  A method to predict functional residues in proteins , 1995, Nature Structural Biology.

[16]  L. Stanfel,et al.  A new approach to clustering the amino acids. , 1996, Journal of theoretical biology.

[17]  Douglas L. Brutlag,et al.  Discovering Empirically Conserved Amino Acid Substitution Groups in Databases of Protein Families , 1996, ISMB.

[18]  Chris Sander,et al.  The FSSP database: fold classification based on structure-structure alignment of proteins , 1996, Nucleic Acids Res..

[19]  D. Baker,et al.  Functional rapidly folding proteins from simplified amino acid sequences , 1997, Nature Structural Biology.

[20]  Y. Sanejouand,et al.  Which effective property of amino acids is best preserved by the genetic code? , 1998, Protein engineering.

[21]  Hiroyuki Ogata,et al.  AAindex: Amino Acid Index Database , 1999, Nucleic Acids Res..

[22]  Jun Wang,et al.  A computational approach to simplifying the protein folding alphabet , 1999, Nature Structural Biology.

[23]  S. Sunyaev,et al.  PSIC: profile extraction from sequence alignments with position-specific counts of independent observations. , 1999, Protein engineering.

[24]  Shmuel Pietrokovski,et al.  Blocks+: a non-redundant database of protein alignment blocks derived from multiple compilations , 1999, Bioinform..

[25]  A. Sali,et al.  Comparative protein structure modeling of genes and genomes. , 2000, Annual review of biophysics and biomolecular structure.

[26]  R. Levy,et al.  Simplified amino acid alphabets for protein fold recognition and implications for folding. , 2000, Protein engineering.

[27]  Mathura S Venkatarajan,et al.  New quantitative descriptors of amino acids based on multidimensional scaling of a large number of physical–chemical properties , 2001 .

[28]  Neal S. Holter,et al.  Amino acid classes and the protein folding problem , 2000, cond-mat/0010244.

[29]  Jimin Pei,et al.  AL2CO: calculation of positional conservation in a protein sequence alignment , 2001, Bioinform..

[30]  S. Rackovsky,et al.  Discriminative ability with respect to amino acid types: Assessing the performance of knowledge‐based potentials without threading , 2002, Proteins.

[31]  Adam Godzik,et al.  In search for more accurate alignments in the twilight zone , 2002, Protein science : a publication of the Protein Society.

[32]  S. Rackovsky,et al.  Optimally informative backbone structural propensities in proteins , 2002, Proteins.

[33]  Stefano Toppo,et al.  Simplifying amino acid alphabets by means of a branch and bound algorithm and substitution matrices , 2002, Bioinform..

[34]  Ceslovas Venclovas,et al.  Comparative modeling in CASP5: Progress is evident, but alignment errors remain a significant hindrance , 2003, Proteins.

[35]  Richard R Copley,et al.  Getting the most from your protein sequence. , 2003, Methods in molecular biology.

[36]  N. Grishin,et al.  COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. , 2003, Journal of molecular biology.

[37]  Jimin Pei,et al.  PCMA: fast and accurate multiple sequence alignment based on profile consistency , 2003, Bioinform..

[38]  Jun Wang,et al.  Reduction of protein sequence complexity by residue grouping. , 2003, Protein engineering.

[39]  J. G. Esteve,et al.  A general clustering approach with application to the Miyazawa–Jernigan potentials for amino acids , 2004, Proteins.

[40]  L. H. Bradley,et al.  De novo proteins from designed combinatorial libraries , 2004, Protein science : a publication of the Protein Society.

[41]  R. Leary,et al.  An optimal structure-discriminative amino acid index for protein fold recognition. , 2004, Biophysical journal.

[42]  Nick Goldman,et al.  A new criterion and method for amino acid classification. , 2004, Journal of theoretical biology.

[43]  Robert C. Edgar,et al.  MUSCLE: a multiple sequence alignment method with reduced time and space complexity , 2004, BMC Bioinformatics.

[44]  Akira R. Kinjo,et al.  Eigenvalue analysis of amino acid substitution matrices reveals a sharp transition of the mode of sequence conservation in proteins , 2004, Bioinform..

[45]  Svante Wold,et al.  A multivariate study of the relationship between the genetic code and the physical-chemical properties of amino acids , 2005, Journal of Molecular Evolution.

[46]  Barry Robson,et al.  What is a conservative substitution? , 1983, Journal of Molecular Evolution.