A protein alignment scoring system sensitive at all evolutionary distances

SummaryProtein sequence alignments generally are constructed with the aid of a “substitution matrix” that specifies a score for aligning each pair of amino acids. Assuming a simple random protein model, it can be shown that any such matrix, when used for evaluating variable-length local alignments, is implicitly a “log-odds” matrix, with a specific probability distribution for amino acid pairs to which it is uniquely tailored. Given a model of protein evolution from which such distributions may be derived, a substitution matrix adapted to detecting relationships at any chosen evolutionary distance can be constructed. Because in a database search it generally is not known a priori what evolutionary distances will characterize the similarities found, it is necessary to employ an appropriate range of matrices in order not to overlook potential homologies. This paper formalizes this concept by defining a scoring system that is sensitive at all detectable evolutionary distances. The statistical behavior of this scoring system is analyzed, and it is shown that for a typical protein database search, estimating the originally unknown evolutionary distance appropriate to each alignment costs slightly over two bits of information, or somewhat less than a factor of five in statistical significance. A much greater cost may be incurred, however, if only a single substitution matrix, corresponding to the wrong evolutionary distance, is employed.

[1]  Rory A. Fisher,et al.  Theory of Statistical Estimation , 1925, Mathematical Proceedings of the Cambridge Philosophical Society.

[2]  E. Gumbel,et al.  Statistics of extremes , 1960 .

[3]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[4]  A. Mclachlan Tests for comparing related amino-acid sequences. Cytochrome c and cytochrome c 551 . , 1971, Journal of molecular biology.

[5]  P. Sellers On the Theory and Computation of Evolutionary Distances , 1974 .

[6]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[7]  Richard W. Hamming,et al.  Coding and Information Theory , 1980 .

[8]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[9]  R. Garrett,et al.  The primary structures of two leghemoglobin genes from soybean. , 1982, Nucleic acids research.

[10]  M. I. Kanehisa,et al.  Pattern recognition in nucleic acid sequences. I. A general method for finding local homologies and symmetries , 1982, Nucleic Acids Res..

[11]  P. Sellers Pattern recognition in genetic sequences by mismatch density , 1984 .

[12]  D. Lipman,et al.  Rapid and sensitive protein similarity searches. , 1985, Science.

[13]  Temple F. Smith,et al.  The statistical distribution of nucleic acid similarities. , 1985, Nucleic acids research.

[14]  W J Wilbur,et al.  On the PAM matrix model of protein evolution. , 1985, Molecular biology and evolution.

[15]  W. Taylor,et al.  Identification of protein sequence homology by consensus template alignment. , 1986, Journal of molecular biology.

[16]  S. Wakabayashi,et al.  Primary sequence of a dimeric bacterial haemoglobin from Vitreoscilla , 1986, Nature.

[17]  S F Altschul,et al.  A nonlinear measure of subalignment similarity and its significance levels. , 1986, Bulletin of mathematical biology.

[18]  J. F. Collins,et al.  Protein and Nucleic Acid Sequence Database Searching: A Suitable Case for Parallel processing , 1987, Comput. J..

[19]  P Argos,et al.  A sensitive procedure to compare amino acid sequences. , 1987, Journal of molecular biology.

[20]  M. Waterman,et al.  Phase transitions in sequence matches and nucleic acid structure. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[21]  J. Mohana Rao New scoring matrix for amino acid residue exchanges based on residue characteristic physical parameters. , 1987, International journal of peptide and protein research.

[22]  J. Stougaard,et al.  Expression of a complete soybean leghemoglobin gene in root nodules of transgenic Lotus corniculatus. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[23]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[24]  L. Patthy,et al.  Detecting homology of distantly related proteins with consensus sequences. , 1987, Journal of molecular biology.

[25]  J. Risler,et al.  Amino acid substitutions in structurally related proteins. A pattern recognition approach. Determination of a new and efficient scoring matrix. , 1988, Journal of molecular biology.

[26]  Samuel Karlin,et al.  Maximal Length of Common Words Among Random Letter Sequences , 1988 .

[27]  J. F. Collins,et al.  The significance of protein sequence similarities , 1988, Comput. Appl. Biosci..

[28]  S F Altschul,et al.  Significance levels for biological sequence comparison using non-linear similarity functions. , 1988, Bulletin of mathematical biology.

[29]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[30]  M. Waterman,et al.  Stochastic scrabble: large deviations for sequences with scores , 1988, Journal of Applied Probability.

[31]  M. Waterman,et al.  THE ERDOS-RENYI STRONG LAW FOR PATTERN MATCHING WITH A GIVEN PROPORTION OF MISMATCHES , 1989 .

[32]  S. B. Needleman,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 1989 .

[33]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[34]  Amir Dembo,et al.  Statistical Composition of High-Scoring Segments from Molecular Sequences , 1990 .

[35]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[36]  Winona C. Barker,et al.  Protein sequence database. , 1990 .

[37]  Michael S. Waterman,et al.  A systolic array processor for biological information signal processing , 1991, ICS '91.

[38]  R.K. Singh,et al.  BioSCAN: a VLSI-based system for biosequence analysis , 1991, [1991 Proceedings] IEEE International Conference on Computer Design: VLSI in Computers and Processors.

[39]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[40]  S. Altschul Amino acid substitution matrices from an information theoretic perspective , 1991, Journal of Molecular Biology.

[41]  S. Altschul,et al.  Improved Sensitivity of Nucleic Acid Database Searches Using Application-Specific Scoring Matrices , 1991 .

[42]  Amir Dembo,et al.  Strong limit theorems of empirical functionals for large exceedances of partial sums of i , 1991 .

[43]  C. Whitfill,et al.  Amino acid sequence of a globin from the sea cucumber Caudina (Molpadia) arenicola. , 1991, Biochimica et biophysica acta.

[44]  Richard Hughey Programmable Systolic Arrays , 1991 .

[45]  S F Altschul,et al.  Statistical methods and insights for protein and DNA sequences. , 1991, Annual review of biophysics and biophysical chemistry.

[46]  P. Argos,et al.  Searching for distantly related protein sequences in large databases by parallel processing on a transputer machine , 1992, Comput. Appl. Biosci..

[47]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[48]  Richard Mott,et al.  Maximum-likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores , 1992 .

[49]  G. Gonnet,et al.  Exhaustive matching of the entire protein sequence database. , 1992, Science.

[50]  William R. Taylor,et al.  The rapid generation of mutation data matrices from protein sequences , 1992, Comput. Appl. Biosci..

[51]  S. Henikoff,et al.  Amino acid substitution matrices. , 2000, Advances in protein chemistry.

[52]  Shane S. Sturrock,et al.  Time Warps, String Edits, and Macromolecules – The Theory and Practice of Sequence Comparison . David Sankoff and Joseph Kruskal. ISBN 1-57586-217-4. Price £13.95 (US$22·95). , 2000 .

[53]  R. Doolittle,et al.  Aligning amino acid sequences: Comparison of commonly used methods , 1985, Journal of Molecular Evolution.

[54]  Lours,et al.  An Extreme Value Theory for Sequence Matching , 2022 .