Solving the protein sequence metric problem.

Biological sequences are composed of long strings of alphabetic letters rather than arrays of numerical values. Lack of a natural underlying metric for comparing such alphabetic data significantly inhibits sophisticated statistical analyses of sequences, modeling structural and functional aspects of proteins, and related problems. Herein, we use multivariate statistical analyses on almost 500 amino acid attributes to produce a small set of highly interpretable numeric patterns of amino acid variability. These high-dimensional attribute data are summarized by five multidimensional patterns of attribute covariation that reflect polarity, secondary structure, molecular volume, codon diversity, and electrostatic charge. Numerical scores for each amino acid then transform amino acid sequences for statistical analyses. Relationships between transformed data and amino acid substitution matrices show significant associations for polarity and codon diversity scores. Transformed alphabetic data are used in analysis of variance and discriminant analysis to study DNA binding in the basic helix-loop-helix proteins. The transformed scores offer a general solution for analyzing a wide variety of sequence analysis problems.

[1]  J. A. Stekol Amino acids and serum proteins , 1964 .

[2]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[3]  P. Sneath Relations between chemical structure and biological activity in peptides. , 1966, Journal of theoretical biology.

[4]  R. Grantham Amino Acid Difference Formula to Help Explain Protein Evolution , 1974, Science.

[5]  Gerald D Fasman Handbook of Biochemistry , 1976 .

[6]  M. Oobatake,et al.  An analysis of non-bonded energy of proteins. , 1977, Journal of theoretical biology.

[7]  M. Levitt,et al.  Conformation of amino acid side-chains in proteins. , 1978, Journal of molecular biology.

[8]  Richard A. Johnson,et al.  Applied Multivariate Statistical Analysis , 1983 .

[9]  Ian T. Jolliffe,et al.  Principal Component Analysis , 1986, Springer Series in Statistics.

[10]  Brian Everitt,et al.  Principles of Multivariate Analysis , 2001 .

[11]  Wojtek J. Krzanowski,et al.  Principles of multivariate analysis : a user's perspective. oxford , 1988 .

[12]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[13]  G. Gonnet,et al.  Exhaustive matching of the entire protein sequence database. , 1992, Science.

[14]  William R. Taylor,et al.  The rapid generation of mutation data matrices from protein sequences , 1992, Comput. Appl. Biosci..

[15]  A. Lapedes,et al.  Covariation of mutations in the V3 loop of human immunodeficiency virus type 1 envelope protein: an information theoretic analysis. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[16]  N D Clarke,et al.  Covariation of residues in the homeodomain sequence family , 1995, Protein science : a publication of the Protein Society.

[17]  I. Grosse,et al.  MEASURING CORRELATIONS IN SYMBOL SEQUENCES , 1995 .

[18]  Ramón Román-Roldán,et al.  Application of information theory to DNA sequence analysis: A review , 1996, Pattern Recognit..

[19]  W. Atchley,et al.  A natural classification of the basic helix-loop-helix class of transcription factors. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[20]  William R. Atchley,et al.  Positional Dependence, Cliques, and Predictive Motifs in the bHLH Protein Domain , 1999, Journal of Molecular Evolution.

[21]  W. Atchley,et al.  Correlations among amino acid sites in bHLH protein domains: an information theoretic analysis. , 2000, Molecular biology and evolution.

[22]  Valérie Ledent,et al.  Phylogenetic analysis of the human basic helix-loop-helix proteins , 2002, Genome Biology.

[23]  M. Vervoort,et al.  The basic helix-loop-helix protein family: comparative genomics and phylogenetic analysis. , 2001, Genome research.

[24]  S. Whelan,et al.  A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. , 2001, Molecular biology and evolution.

[25]  Simon Whelan,et al.  A novel use of equilibrium frequencies in models of sequence evolution. , 2002, Molecular biology and evolution.

[26]  I. Jolliffe Principal Component Analysis , 2002 .

[27]  Andrew D Fernandes,et al.  Sequence signatures and the probabilistic identification of proteins in the Myc-Max-Mad network. , 2005, Proceedings of the National Academy of Sciences of the United States of America.