Sequence variations within protein families are linearly related to structural variations.

It is commonly believed that similarities between the sequences of two proteins infer similarities between their structures. Sequence alignments reliably recognize pairs of protein of similar structures provided that the percentage sequence identity between their two sequences is sufficiently high. This distinction, however, is statistically less reliable when the percentage sequence identity is lower than 30% and little is known then about the detailed relationship between the two measures of similarity. Here, we investigate the inverse correlation between structural similarity and sequence similarity on 12 protein structure families. We define the structure similarity between two proteins as the cRMS distance between their structures. The sequence similarity for a pair of proteins is measured as the mean distance between the sequences in the subsets of sequence space compatible with their structures. We obtain an approximation of the sequence space compatible with a protein by designing a collection of protein sequences both stable and specific to the structure of that protein. Using these measures of sequence and structure similarities, we find that structural changes within a protein family are linearly related to changes in sequence similarity.

[1]  P. Koehl,et al.  Polar and nonpolar atomic environments in the protein core: Implications for folding and binding , 1994, Proteins.

[2]  P. Kraulis A program to produce both detailed and schematic plots of protein structures , 1991 .

[3]  M. Levitt,et al.  Structural similarity of DNA-binding domains of bacteriophage repressors and the globin core , 1993, Current Biology.

[4]  Suganthi Balasubramanian,et al.  Protein alchemy: Changing β-sheet into α-helix , 1997, Nature Structural Biology.

[5]  B. Rost Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[6]  P. Koehl,et al.  Application of a self-consistent mean field theory to predict protein side-chains conformation and estimate their conformational entropy. , 1994, Journal of molecular biology.

[7]  Jooyoung Lee,et al.  New optimization method for conformational energy calculations on polypeptides: Conformational space annealing , 1997, J. Comput. Chem..

[8]  M. Levitt,et al.  De novo protein design. II. Plasticity in sequence space. , 1999, Journal of molecular biology.

[9]  A. Lesk,et al.  The relation between the divergence of sequence and structure in proteins. , 1986, The EMBO journal.

[10]  M. Levitt,et al.  De novo protein design. I. In search of stability and specificity. , 1999, Journal of molecular biology.

[11]  L. Pauling,et al.  Evolutionary Divergence and Convergence in Proteins , 1965 .

[12]  B. Rost,et al.  Protein structures sustain evolutionary drift. , 1997, Folding & design.

[13]  E. Shakhnovich,et al.  Engineering of stable and fast-folding sequences of model proteins. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[14]  R. Doolittle Of urfs and orfs : a primer on how to analyze devised amino acid sequences , 1986 .

[15]  C. Chothia,et al.  Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[16]  P. Koehl,et al.  Protein structure similarities. , 2001, Current opinion in structural biology.

[17]  G J Williams,et al.  The Protein Data Bank: a computer-based archival file for macromolecular structures. , 1977, Journal of molecular biology.

[18]  Janet M. Thornton,et al.  Protein domain superfolds and superfamilies , 1994 .

[19]  S. Balasubramanian,et al.  Protein alchemy: changing beta-sheet into alpha-helix. , 1997, Nature structural biology.

[20]  G J Williams,et al.  The Protein Data Bank: a computer-based archival file for macromolecular structures. , 1978, Archives of biochemistry and biophysics.

[21]  M. Sternberg,et al.  Recognition of analogous and homologous protein folds: analysis of sequence and structure conservation. , 1997, Journal of molecular biology.

[22]  M Levitt,et al.  Comprehensive assessment of automatic structural alignment against a manual standard, the scop classification of proteins , 1998, Protein science : a publication of the Protein Society.

[23]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[24]  A. Gronenborn,et al.  A novel, highly stable fold of the immunoglobulin binding domain of streptococcal protein G. , 1993, Science.

[25]  G L Gilliland,et al.  Two crystal structures of the B1 immunoglobulin-binding domain of streptococcal protein G and comparison with NMR. , 1994, Biochemistry.

[26]  Patrice Koehl,et al.  Protein topology and stability define the space of allowed sequences , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[27]  M Kokkinidis,et al.  Structure of the ColE1 rop protein at 1.7 A resolution. , 1987, Journal of molecular biology.

[28]  V S Pande,et al.  Statistical mechanics of simple models of protein folding and design. , 1997, Biophysical journal.

[29]  S Subbiah,et al.  A structural explanation for the twilight zone of protein sequence homology. , 1996, Structure.

[30]  R. Doolittle Similar amino acid sequences: chance or common ancestry? , 1981, Science.

[31]  J. M. Sauder,et al.  Large‐scale comparison of protein sequence alignment algorithms with structure alignments , 2000, Proteins.

[32]  T. P. Flores,et al.  Comparison of conformational characteristics in structurally similar protein pairs , 1993, Protein science : a publication of the Protein Society.

[33]  L. Björck,et al.  Three-dimensional solution structure of an immunoglobulin light chain-binding domain of protein L. Comparison with the IgG-binding domains of protein G. , 1994, Biochemistry.

[34]  D. T. Jones,et al.  A new approach to protein fold recognition , 1992, Nature.

[35]  W. Pearson,et al.  Evolution of protein sequences and structures. , 1999, Journal of molecular biology.

[36]  S. Eddy Hidden Markov models. , 1996, Current opinion in structural biology.

[37]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[38]  Z. X. Wang,et al.  A re-estimation for the total numbers of protein folds and superfamilies. , 1998, Protein engineering.

[39]  Ruben Recabarren,et al.  Estimating the total number of protein folds , 1999, Proteins.

[40]  G. Rose,et al.  Protein folding--what's the question? , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[41]  A. Liwo,et al.  Energy-based de novo protein folding by conformational space annealing and an off-lattice united-residue force field: application to the 10-55 fragment of staphylococcal protein A and to apo calbindin D9K. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[42]  C. Chothia One thousand families for the molecular biologist , 1992, Nature.

[43]  E. Shakhnovich,et al.  A new approach to the design of stable proteins. , 1993, Protein engineering.