Dynamic programming algorithms for biological sequence comparison.

Efficient dynamic programming algorithms are available for a broad class of protein and DNA sequence comparison problems. These algorithms require computer time proportional to the product of the lengths of the two sequences being compared [O(N2)] but require memory space proportional only to the sum of these lengths [O(N)]. Although the requirement for O(N2) time limits use of the algorithms to the largest computers when searching protein and DNA sequence databases, many other applications of these algorithms, such as calculation of distances for evolutionary trees and comparison of a new sequence to a library of sequence profiles, are well within the capabilities of desktop computers. In particular, the results of library searches with rapid searching programs, such as FASTA or BLAST, should be confirmed by performing a rigorous optimal alignment. Whereas rapid methods do not overlook significant sequence similarities, FASTA limits the number of gaps that can be inserted into an alignment, so that a rigorous alignment may extend the alignment substantially in some cases. BLAST does not allow gaps in the local regions that it reports; a calculation that allows gaps is very likely to extend the alignment substantially. Although a Monte Carlo evaluation of the statistical significance of a similarity score with a rigorous algorithm is much slower than the heuristic approach used by the RDF2 program, the dynamic programming approach should take less than 1 hr on a 386-based PC or desktop Unix workstation. For descriptive purposes, we have limited our discussion to methods for calculating similarity scores and distances that use gap penalties of the form g = rk. Nevertheless, programs for the more general case (g = q+rk) are readily available. Versions of these programs that run either on Unix workstations, IBM-PC class computers, or the Macintosh can be obtained from either of the authors.

[1]  Bruce W. Erickson,et al.  Optimal sequence alignment using affine gap costs , 1986 .

[2]  R. Doolittle Similar amino acid sequences: chance or common ancestry? , 1981, Science.

[3]  E. Myers,et al.  Approximate matching of regular expressions. , 1989, Bulletin of mathematical biology.

[4]  Peter H. Sellers,et al.  The Theory and Computation of Evolutionary Distances: Pattern Recognition , 1980, J. Algorithms.

[5]  W. Pearson Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. , 1991, Genomics.

[6]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[7]  M. I. Kanehisa,et al.  Pattern recognition in nucleic acid sequences. I. A general method for finding local homologies and symmetries , 1982, Nucleic Acids Res..

[8]  W. Miller,et al.  A time-efficient, linear-space local similarity algorithm , 1991 .

[9]  W. Fitch,et al.  Construction of phylogenetic trees. , 1967, Science.

[10]  Raffaele Giancarlo,et al.  Speeding up Dynamic Programming with Applications to Molecular Biology , 1989, Theor. Comput. Sci..

[11]  M S Waterman,et al.  Efficient sequence alignment algorithms. , 1984, Journal of theoretical biology.

[12]  D. Lipman,et al.  The multiple sequence alignment problem in biology , 1988 .

[13]  Esko Ukkonen,et al.  Algorithms for Approximate String Matching , 1985, Inf. Control..

[14]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[15]  R. Bellman Dynamic programming. , 1957, Science.

[16]  S F Altschul,et al.  A nonlinear measure of subalignment similarity and its significance levels. , 1986, Bulletin of mathematical biology.

[17]  M. Sternberg,et al.  Flexible protein sequence patterns. A sensitive method to detect weak structural similarities. , 1990, Journal of molecular biology.

[18]  W. A. Beyer,et al.  Some Biological Sequence Metrics , 1976 .

[19]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[20]  Michael S. Waterman,et al.  Algorithms for restriction map comparisons , 1984, Nucleic Acids Res..

[21]  P. Sellers Pattern recognition in genetic sequences by mismatch density , 1984 .

[22]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[23]  Michael S. Waterman,et al.  An Extreme Value Theory for Sequence Matching , 1986 .

[24]  Webb Miller,et al.  Parallelization of a local similarity algorithm , 1992, Comput. Appl. Biosci..

[25]  E. Myers,et al.  Sequence comparison with concave weighting functions. , 1988, Bulletin of mathematical biology.

[26]  M. Waterman,et al.  A new algorithm for best subsequence alignments with application to tRNA-rRNA comparisons. , 1987, Journal of molecular biology.

[27]  Xiaoqiu Huang A Lower Bound for the Edit-Distance Problem Under an Arbitrary Cost Function , 1988, Inf. Process. Lett..

[28]  Eugene W. Myers,et al.  An O(NP) Sequence Comparison Algorithm , 1990, Inf. Process. Lett..

[29]  Daniel S. Hirschberg,et al.  A linear space algorithm for computing maximal common subsequences , 1975, Commun. ACM.

[30]  M. Zuker,et al.  The alignment of protein structures in three dimensions. , 1989, Bulletin of mathematical biology.

[31]  J. Felsenstein Phylogenies from molecular sequences: inference and reliability. , 1988, Annual review of genetics.

[32]  James W. Fickett,et al.  Fast optimal alignment , 1984, Nucleic Acids Res..

[33]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[34]  T. Südhof,et al.  The LDL receptor gene: a mosaic of exons shared with different proteins. , 1985, Science.

[35]  P. Sellers On the Theory and Computation of Evolutionary Distances , 1974 .

[36]  Eugene W. Myers,et al.  Row replacement algorithms for screen editors , 1989, TOPL.

[37]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[38]  Yuval Rabani,et al.  On the Space Complexity of Some Algorithms for Sequence Comparison , 1992, Theor. Comput. Sci..