Database of homology‐derived protein structures and the structural meaning of sequence alignment

The database of known protein three‐dimensional structures can be significantly increased by the use of sequence homology, based on the following observations. (1) The database of known sequences, currently at more than 12,000 proteins, is two orders of magnitude larger than the database of known structures. (2) The currently most powerful method of predicting protein structures is model building by homology. (3) Structural homology can be inferred from the level of sequence similarity. (4) The threshold of sequence similarity sufficient for structural homology depends strongly on the length of the alignment. Here, we first quantify the relation between sequence similarity, structure similarity, and alignment length by an exhaustive survey of alignments between proteins of known structure and report a homology threshold curve as a function of alignment length. We then produce a database of homology‐derived secondary structure of proteins (HSSP) by aligning to each protein of known structure all sequences deemed homologous on the basis of the threshold curve. For each known protein structure, the derived database contains the aligned sequences, secondary structure, sequence variability, and sequence profile. Tertiary structures of the aligned sequences are implied, but not modeled explicity. The database effectively increases the number of known protein structures by a factor of five to more than 1800. The results may be useful in assessing the structural significance of matches in sequence database searches, in deriving preferences and patterns for structure prediction, in elucidating the structural role of conserved residues, and in modeling three‐dimensional detail by homology.

[1]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[2]  A. Mclachlan Tests for comparing related amino-acid sequences. Cytochrome c and cytochrome c 551 . , 1971, Journal of molecular biology.

[3]  J. Drenth,et al.  A comparison of the three-dimensional structures of subtilisin BPN' and subtilisin novo. , 1972, Cold Spring Harbor symposia on quantitative biology.

[4]  D W Banner,et al.  Atomic coordinates for triose phosphate isomerase from chicken muscle. , 1976, Biochemical and biophysical research communications.

[5]  L M Amzel,et al.  Preliminary refinement and structural analysis of the Fab fragment from human immunoglobulin new at 2.0 A resolution. , 1981, The Journal of biological chemistry.

[6]  W. Hol,et al.  Structure of bovine liver rhodanese. I. Structure determination at 2.5 A resolution and a comparison of the conformation and sequence of its two domains. , 1978, Journal of molecular biology.

[7]  G J Williams,et al.  The Protein Data Bank: a computer-based archival file for macromolecular structures. , 1978, Archives of biochemistry and biophysics.

[8]  W. Kabsch A discussion of the solution for the best rotation to relate two sets of vectors , 1978 .

[9]  W G Hol,et al.  Structure of bovine liver rhodanese. I. Structure determination at 2.5 A resolution and a comparison of the conformation and sequence of its two domains. , 1978, Journal of Molecular Biology.

[10]  H A Scheraga,et al.  Improvements in the prediction of protein backbone topography by reduction of statistical errors. , 1979, Biochemistry.

[11]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[12]  Michel Frey,et al.  Crystal structure and electron transfer properties of cytochrome c3. , 1985, The Journal of biological chemistry.

[13]  Robert Huber,et al.  On the disordered activation domain in trypsinogen: chemical labelling and low‐temperature crystallography , 1982 .

[14]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[15]  W G Hol,et al.  Structure of porcine pancreatic phospholipase A2 at 2.6 A resolution and comparison with bovine phospholipase A2. , 1983, Journal of molecular biology.

[16]  M. F. Perutz,et al.  The crystal structure of human deoxyhaemoglobin at 1.74 A resolution , 1984 .

[17]  N. Yasuoka,et al.  Refined structure of cytochrome c3 at 1.8 A resolution. , 1984, Journal of molecular biology.

[18]  J. Devereux,et al.  A comprehensive set of sequence analysis programs for the VAX , 1984, Nucleic Acids Res..

[19]  M. Teeter,et al.  Water structure of a hydrophobic protein at atomic resolution: Pentagon rings of water molecules in crystals of crambin. , 1981, Proceedings of the National Academy of Sciences of the United States of America.

[20]  K. H. Kalk,et al.  Role of the N-terminus in the interaction of pancreatic phospholipase A2 with aggregated substrates. Properties and crystal structure of transaminated phospholipase A2. , 1983, Biochemistry.

[21]  M. Perutz,et al.  The crystal structure of human deoxyhaemoglobin at 1.74 A resolution. , 1984, Journal of molecular biology.

[22]  C Sander,et al.  On the use of sequence homologies to predict protein structure: identical pentapeptides can have completely different conformations. , 1984, Proceedings of the National Academy of Sciences of the United States of America.

[23]  N. Yasuoka,et al.  REFINED STRUCTURE OF CYTOCHROME C3 AT 1.8 ANGSTROMS RESOLUTION , 1984 .

[24]  W. Lipscomb,et al.  Structure of unligated aspartate carbamoyltransferase of Escherichia coli at 2.6-A resolution. , 1985, Proceedings of the National Academy of Sciences of the United States of America.

[25]  B. Clark,et al.  Structural details of the binding of guanosine diphosphate to elongation factor Tu from E. coli as studied by X‐ray crystallography. , 1985, The EMBO journal.

[26]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[27]  Temple F. Smith,et al.  The statistical distribution of nucleic acid similarities. , 1985, Nucleic acids research.

[28]  A. Lesk,et al.  The relation between the divergence of sequence and structure in proteins. , 1986, The EMBO journal.

[29]  J. Gibrat,et al.  Further developments of protein secondary structure prediction using information theory. New parameters and consideration of residue pairs. , 1987, Journal of molecular biology.

[30]  P Argos,et al.  A sensitive procedure to compare amino acid sequences. , 1987, Journal of molecular biology.

[31]  A. Lesk,et al.  Determinants of a protein fold. Unique features of the globin amino acid sequences. , 1987, Journal of molecular biology.

[32]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[33]  K H Kim,et al.  Structural asymmetry in the CTP-liganded form of aspartate carbamoyltransferase from Escherichia coli. , 1987, Journal of molecular biology.

[34]  Rodger Staden,et al.  Methods to define and locate patterns of motifs in sequences , 1988, Comput. Appl. Biosci..

[35]  J. Risler,et al.  Amino acid substitutions in structurally related proteins. A pattern recognition approach. Determination of a new and efficient scoring matrix. , 1988, Journal of molecular biology.

[36]  O. Epp,et al.  Structure of native porcine pancreatic elastase at 1.65 A resolutions. , 1988, Acta crystallographica. Section B, Structural science.

[37]  Shoshana J. Wodak,et al.  Identification of predictive sequence motifs limited by protein structure data base size , 1988, Nature.

[38]  STRUCTURE OF NATIVE PORCINE PANCREATIC ELASTASE AT 1.65 ANGSTROMS RESOLUTION , 1988 .

[39]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[40]  W. Kabsch,et al.  Structure of the guanine-nucleotide-binding domain of the Ha-ras oncogene product p21 in the triphosphate conformation , 1989, Nature.

[41]  Martin Vingron,et al.  A fast and sensitive multiple sequence alignment algorithm , 1989, Comput. Appl. Biosci..

[42]  A Godzik,et al.  Conservation of residue interactions in a family of Ca-binding proteins. , 1989, Protein engineering.

[43]  Chris Sander,et al.  Thermitase, a thermostable subtilisin: Comparison of predicted and experimental structures and the molecular cause of thermostability , 1989, Proteins.

[44]  C. Frömmel,et al.  Crystal structure of thermitase from Thermoactinomyces vulgaris at 2.2 Å resolution , 1989, FEBS letters.

[45]  R. F. Smith,et al.  Automatic generation of primary sequence patterns from sets of related protein sequences. , 1990, Proceedings of the National Academy of Sciences of the United States of America.