Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships.

Pairwise sequence comparison methods have been assessed using proteins whose relationships are known reliably from their structures and functions, as described in the SCOP database [Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia C. (1995) J. Mol. Biol. 247, 536-540]. The evaluation tested the programs BLAST [Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. (1990). J. Mol. Biol. 215, 403-410], WU-BLAST2 [Altschul, S. F. & Gish, W. (1996) Methods Enzymol. 266, 460-480], FASTA [Pearson, W. R. & Lipman, D. J. (1988) Proc. Natl. Acad. Sci. USA 85, 2444-2448], and SSEARCH [Smith, T. F. & Waterman, M. S. (1981) J. Mol. Biol. 147, 195-197] and their scoring schemes. The error rate of all algorithms is greatly reduced by using statistical scores to evaluate matches rather than percentage identity or raw scores. The E-value statistical scores of SSEARCH and FASTA are reliable: the number of false positives found in our tests agrees well with the scores reported. However, the P-values reported by BLAST and WU-BLAST2 exaggerate significance by orders of magnitude. SSEARCH, FASTA ktup = 1, and WU-BLAST2 perform best, and they are capable of detecting almost all relationships between proteins whose sequence identities are >30%. For more distantly related proteins, they do much less well; only one-half of the relationships between proteins with 20-30% identity are found. Because many homologs have low sequence similarity, most distant relationships cannot be detected by any pairwise comparison method; however, those which are identified may be used with confidence.

[1]  F. Young Biochemistry , 1955, The Indian Medical Gazette.

[2]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[3]  J. C. Kendrew,et al.  Structure and function of haemoglobin: II. Some relations between polypeptide chain configuration and amino acid sequence , 1965 .

[4]  W. Fitch An improved method of testing for evolutionary homology. , 1966, Journal of molecular biology.

[5]  K. Lark,et al.  Regulation of chromosome replication in Escherichia coli: a comparison of the effects of phenethyl alcohol treatment with those of amino acid starvation. , 1966, Journal of molecular biology.

[6]  E. L. Amma,et al.  Molecular packing and intermolecular contacts of sickling deer type III hemoglobin. , 1979, Journal of molecular biology.

[7]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[8]  W. John Wilbur,et al.  On the statistical significance of nucleic acid similarities , 1984, Nucleic Acids Res..

[9]  M Levitt,et al.  Alignment of the amino acid sequences of distantly related proteins using variable gap penalties. , 1986, Protein engineering.

[10]  Michael S. Waterman,et al.  An Extreme Value Theory for Sequence Matching , 1986 .

[11]  G J Barton,et al.  Evaluation and improvements in the automatic alignment of protein sequences. , 1987, Protein engineering.

[12]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[13]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[14]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[15]  W. Pearson Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. , 1991, Genomics.

[16]  C. Sander,et al.  Database of homology‐derived protein structures and the structural meaning of sequence alignment , 1991, Proteins.

[17]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[18]  P. Karplus,et al.  Crystal structure of the catalytic domain of a thermophilic endocellulase. , 1993, Biochemistry.

[19]  S. Karlin,et al.  Applications and statistics for multiple high-scoring segments in molecular sequences. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[20]  M. Zweig,et al.  Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. , 1993, Clinical chemistry.

[21]  S Henikoff,et al.  Performance evaluation of amino acid substitution matrices , 1993, Proteins.

[22]  John P. Overington,et al.  A structural basis for sequence comparisons. An evaluation of scoring methodologies. , 1993, Journal of molecular biology.

[23]  S. Altschul,et al.  Issues in searching molecular sequence databases , 1994, Nature Genetics.

[24]  Martin Vingron,et al.  Sequence Comparison Significance and Poisson Approximation , 1994 .

[25]  P. Argos,et al.  An assessment of amino acid exchange matrices in aligning protein sequences: the twilight zone revisited. , 1995, Journal of molecular biology.

[26]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[27]  R A Sayle,et al.  RASMOL: biomolecular graphics for all. , 1995, Trends in biochemical sciences.

[28]  W. Pearson Comparison of methods for searching protein sequence databases , 1995, Protein science : a publication of the Protein Society.

[29]  A Tsugita,et al.  The PIR-International Protein Sequence Database. , 1996, Nucleic acids research.

[30]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its new supplement TREMBL , 1996, Nucleic Acids Res..

[31]  Amos Bairoch,et al.  The PROSITE database, its status in 1995 , 1996, Nucleic Acids Res..

[32]  Michael Gribskov,et al.  Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching , 1996, Comput. Chem..

[33]  J. Wootton,et al.  Analysis of compositionally biased regions in sequence databases. , 1996, Methods in enzymology.

[34]  S Subbiah,et al.  A structural explanation for the twilight zone of protein sequence homology. , 1996, Structure.

[35]  W. Pearson Effective protein sequence comparison. , 1996, Methods in enzymology.

[36]  S F Altschul,et al.  Local alignment statistics. , 1996, Methods in enzymology.

[37]  C. Chothia,et al.  Understanding protein structure: using scop for fold interpretation. , 1996, Methods in enzymology.

[38]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[39]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[40]  C. Chothia,et al.  Population statistics of protein structures: lessons from structural classifications. , 1997, Current opinion in structural biology.

[41]  R. Quatrano Genomics , 1998, Plant Cell.