A unified statistical framework for sequence comparison and structure comparison.

We present an approach for assessing the significance of sequence and structure comparisons by using nearly identical statistical formalisms for both sequence and structure. Doing so involves an all-vs.-all comparison of protein domains [taken here from the Structural Classification of Proteins (scop) database] and then fitting a simple distribution function to the observed scores. By using this distribution, we can attach a statistical significance to each comparison score in the form of a P value, the probability that a better score would occur by chance. As expected, we find that the scores for sequence matching follow an extreme-value distribution. The agreement, moreover, between the P values that we derive from this distribution and those reported by standard programs (e.g., BLAST and FASTA validates our approach. Structure comparison scores also follow an extreme-value distribution when the statistics are expressed in terms of a structural alignment score (essentially the sum of reciprocated distances between aligned atoms minus gap penalties). We find that the traditional metric of structural similarity, the rms deviation in atom positions after fitting aligned atoms, follows a different distribution of scores and does not perform as well as the structural alignment score. Comparison of the sequence and structure statistics for pairs of proteins known to be related distantly shows that structural comparison is able to detect approximately twice as many distant relationships as sequence comparison at the same error rate. The comparison also indicates that there are very few pairs with significant similarity in terms of sequence but not structure whereas many pairs have significant similarity in terms of structure but not sequence.

[1]  F. Young Biochemistry , 1955, The Indian Medical Gazette.

[2]  Journal of Molecular Biology , 1959, Nature.

[3]  T. Creighton Methods in Enzymology , 1968, The Yale Journal of Biology and Medicine.

[4]  良二 上田 J. Appl. Cryst.の発刊に際して , 1970 .

[5]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[6]  W. Kabsch A solution for the best rotation to relate two sets of vectors , 1976 .

[7]  C. Chothia,et al.  Structural patterns in globular proteins , 1976, Nature.

[8]  G J Williams,et al.  The Protein Data Bank: a computer-based archival file for macromolecular structures. , 1977, Journal of molecular biology.

[9]  S J Remington,et al.  A systematic approach to the comparison of protein structures. , 1980, Journal of molecular biology.

[10]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[11]  D. Lipman,et al.  Rapid and sensitive protein similarity searches. , 1985, Science.

[12]  R. Doolittle,et al.  Of urfs and orfs , 1986 .

[13]  Y. Satow,et al.  Phosphocholine binding immunoglobulin Fab McPC603. An X-ray diffraction study at 2.7 A. , 1985, Journal of molecular biology.

[14]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[15]  W R Taylor,et al.  Protein structure alignment. , 1989, Journal of molecular biology.

[16]  Peter Willett,et al.  Searching techniques for databases of protein secondary structures , 1989, J. Inf. Sci..

[17]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[18]  F. Rohlf,et al.  Extensions of the Procrustes Method for the Optimal Superimposition of Landmarks , 1990 .

[19]  T. Blundell,et al.  Definition of general topological equivalence in protein structures. A procedure involving comparison of properties and relationships through simulated annealing and dynamic programming. , 1990, Journal of molecular biology.

[20]  M. Gribskov,et al.  Sequence Analysis Primer , 1991 .

[21]  C. Sander,et al.  Detection of common three‐dimensional substructures in proteins , 1991, Proteins.

[22]  S F Altschul,et al.  Statistical methods and insights for protein and DNA sequences. , 1991, Annual review of biophysics and biophysical chemistry.

[23]  Nature Genetics , 1991, Nature.

[24]  G. Barton,et al.  Multiple protein sequence alignment from tertiary structure comparison: Assignment of global and residue confidence levels , 1992, Proteins.

[25]  M. Levitt,et al.  Structural similarity of DNA-binding domains of bacteriophage repressors and the globin core , 1993, Current Biology.

[26]  S. Karlin,et al.  Applications and statistics for multiple high-scoring segments in molecular sequences. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[27]  A. Jacobson,et al.  Morphometric tools for landmark data , 1993 .

[28]  C. Sander,et al.  Protein structure comparison by alignment of distance matrices. , 1993, Journal of molecular biology.

[29]  A. Lesk,et al.  Structural mechanisms for domain movements in proteins. , 1994, Biochemistry.

[30]  S. Altschul,et al.  Issues in searching molecular sequence databases , 1994, Nature Genetics.

[31]  M Levitt,et al.  Different protein sequences can give rise to highly similar folds through different stabilizing interactions , 1994, Protein science : a publication of the Protein Society.

[32]  Mark Gerstein,et al.  Using a measure of structural variation to define a core for the globins , 1995, Comput. Appl. Biosci..

[33]  M. Gerstein,et al.  Average core structures and variability measures for protein families: application to the immunoglobulins. , 1995, Journal of molecular biology.

[34]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[35]  S. Bryant,et al.  Statistics of sequence-structure threading. , 1995, Current opinion in structural biology.

[36]  C. Chothia,et al.  Gene duplications in H. influenzae , 1995, Nature.

[37]  C Sander,et al.  Mapping the Protein Universe , 1996, Science.

[38]  J F Gibrat,et al.  Surprising similarities in structure comparison. , 1996, Current opinion in structural biology.

[39]  F. Cohen,et al.  A surface of minimum area metric for the structural comparison of proteins. , 1996, Journal of molecular biology.

[40]  W. Pearson Effective protein sequence comparison. , 1996, Methods in enzymology.

[41]  S F Altschul,et al.  Local alignment statistics. , 1996, Methods in enzymology.

[42]  C. Chothia,et al.  Understanding protein structure: using scop for fold interpretation. , 1996, Methods in enzymology.

[43]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[44]  Tim J. P. Hubbard,et al.  SCOP: a structural classification of proteins database , 1998, Nucleic Acids Res..

[45]  William R. Pearson,et al.  Identifying distantly related protein sequences. , 1997, Computer applications in the biosciences : CABIOS.

[46]  J L Sussman,et al.  Protein Data Bank archives of three-dimensional macromolecular structures. , 1997, Methods in enzymology.

[47]  M Levitt,et al.  Comprehensive assessment of automatic structural alignment against a manual standard, the scop classification of proteins , 1998, Protein science : a publication of the Protein Society.

[48]  S. Snyder,et al.  Proceedings of the National Academy of Sciences , 1999 .