Efficient algorithms for sequence analysis with concave and convex gap costs

We describe algorithms for two problems in sequence analysis: sequence alignment with gaps (multiple consecutive insertions and deletions treated as a unit) and RNA secondary structure with single loops only. We make the assumption that the gap cost or loop cost is a convex or concave function of the length of the gap or loop, and show how this assumption may be used to develop efficient algorithms for these problems. We show how the restriction to convex or concave functions may be relaxed, and give algorithms for solving the problems when the cost functions are neither convex nor concave, but can be split into a small number of convex or concave functions. Finally we point out some sparsity in the structure of our sequence analysis problems, and describe how we may take advantage of that sparsity to further speed up our algorithms.

[1]  Henry Fuchs,et al.  ON FINDING SEVERAL SHORTEST PATHS IN CERTAIN GRAPHS. , 2022, Allerton Conference on Communication, Control, and Computing.

[2]  Louis N. Ridenour,et al.  The Role of the Computer , 1952 .

[3]  J. Mitchison Cell Biology , 1964, Nature.

[4]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[5]  T. K. Vintsyuk Speech discrimination by dynamic programming , 1968 .

[6]  N. G. Zagoruyko,et al.  Automatic recognition of 200 words , 1970 .

[7]  Hiroaki Sakoe,et al.  A Dynamic Programming Approach to Continuous Speech Recognition , 1971 .

[8]  D Sankoff,et al.  Matching sequences under deletion-insertion constraints. , 1972, Proceedings of the National Academy of Sciences of the United States of America.

[9]  T. Reichert,et al.  An application of information theory to genetic mutations and the matching of polypeptide sequences. , 1973, Journal of theoretical biology.

[10]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[11]  P. Sellers On the Theory and Computation of Evolutionary Distances , 1974 .

[12]  Alfred V. Aho,et al.  The Design and Analysis of Computer Algorithms , 1974 .

[13]  J.-P. Haton A practical application of a real-time isolated-word recognition system using syntactic constraints , 1974 .

[14]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[15]  M. V. Wilkes,et al.  The Art of Computer Programming, Volume 3, Sorting and Searching , 1974 .

[16]  Daniel S. Hirschberg,et al.  A linear space algorithm for computing maximal common subsequences , 1975, Commun. ACM.

[17]  Robert A. Wagner,et al.  On the complexity of the Extended String-to-String Correction Problem , 1975, STOC.

[18]  Peter van Emde Boas,et al.  Preserving order in a forest in less than logarithmic time , 1975, 16th Annual Symposium on Foundations of Computer Science (sfcs 1975).

[19]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[20]  W. A. Beyer,et al.  Some Biological Sequence Metrics , 1976 .

[21]  Thomas G. Szymanski,et al.  A fast algorithm for computing longest common subsequences , 1977, CACM.

[22]  Daniel S. Hirschberg,et al.  Algorithms for the Longest Common Subsequence Problem , 1977, JACM.

[23]  S. Grimwade Recombinant DNA , 1977, Nature.

[24]  F. Sanger,et al.  DNA sequencing with chain-terminating inhibitors. , 1977, Proceedings of the National Academy of Sciences of the United States of America.

[25]  David Maier,et al.  The Complexity of Some Problems on Subsequences and Supersequences , 1978, JACM.

[26]  M. Waterman,et al.  RNA secondary structure: a complete mathematical analysis , 1978 .

[27]  Jerrold R. Griggs,et al.  Algorithms for Loop Matchings , 1978 .

[28]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[29]  David S. Johnson,et al.  Computers and In stractability: A Guide to the Theory of NP-Completeness. W. H Freeman, San Fran , 1979 .

[30]  Temple F. Smith,et al.  New Stratigraphic Correlation Techniques , 1980, The Journal of Geology.

[31]  R. Nussinov,et al.  Fast algorithm for predicting the secondary structure of single-stranded RNA. , 1980, Proceedings of the National Academy of Sciences of the United States of America.

[32]  F. Frances Yao,et al.  Efficient dynamic programming using quadrangle inequalities , 1980, STOC '80.

[33]  W. Gilbert,et al.  Sequencing end-labeled DNA with base-specific chemical cleavages. , 1980, Methods in enzymology.

[34]  Jon Louis Bentley,et al.  Decomposable Searching Problems I: Static-to-Dynamic Transformation , 1980, J. Algorithms.

[35]  Mike Paterson,et al.  A Faster Algorithm Computing String Edit Distances , 1980, J. Comput. Syst. Sci..

[36]  Donald E. Knuth,et al.  Breaking paragraphs into lines , 1981, Softw. Pract. Exp..

[37]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[38]  Minoru I. Kanehisa,et al.  Pattern recognition in nucleic acid sequences. II. An efficient method for finding locally stable secondary structures , 1982, Nucleic Acids Res..

[39]  David Sankoff,et al.  Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison , 1983 .

[40]  Alfred V. Aho,et al.  Data Structures and Algorithms , 1983 .

[41]  D. Lipman,et al.  Rapid similarity searches of nucleic acid and protein data banks. , 1983, Proceedings of the National Academy of Sciences of the United States of America.

[42]  Robert E. Tarjan,et al.  Data structures and network algorithms , 1983, CBMS-NSF regional conference series in applied mathematics.

[43]  Michael S. Waterman,et al.  General methods of sequence comparison , 1984 .

[44]  D. Lipman,et al.  THE CONTEXT DEPENDENT COMPARISON OF BIOLOGICAL SEQUENCES , 1984 .

[45]  Gad M. Landau,et al.  Efficient string matching in the presence of errors , 1985, 26th Annual Symposium on Foundations of Computer Science (sfcs 1985).

[46]  D. Sankoff Simultaneous Solution of the RNA Folding, Alignment and Protosequence Problems , 1985 .

[47]  Lawrence L. Larmore,et al.  The least weight subsequence problem , 1987, 26th Annual Symposium on Foundations of Computer Science (sfcs 1985).

[48]  D. Lipman,et al.  Rapid and sensitive protein similarity searches. , 1985, Science.

[49]  W. Bains,et al.  MULTAN: a program to align multiple DNA sequences , 1986, Nucleic Acids Res..

[50]  G. H. Hamm,et al.  The EMBL data library , 1993, Nucleic Acids Res..

[51]  Alok Aggarwal,et al.  Geometric applications of a matrix-searching algorithm , 1987, SCG '86.

[52]  Temple F. Smith,et al.  Rapid dynamic programming algorithms for RNA secondary structure , 1986 .

[53]  Gad M. Landau,et al.  Introducing efficient parallelism into approximate string matching and a new serial algorithm , 1986, STOC '86.

[54]  O. Gotoh Alignment of three biological sequences with an efficient traceback procedure. , 1986, Journal of theoretical biology.

[55]  James W. Fickett,et al.  The GenBank genetic sequence databank , 1986, Nucleic Acids Res..

[56]  M. Sternberg,et al.  A strategy for the rapid multiple alignment of protein sequences. Confidence levels from tertiary structure comparisons. , 1987, Journal of molecular biology.

[57]  Robert E. Wilber The Concave Least-Weight Subsequence Problem Revisited , 1988, J. Algorithms.

[58]  Raffaele Giancarlo,et al.  Data structures and algorithms for approximate string matching , 1988, J. Complex..

[59]  E. Myers,et al.  Sequence comparison with concave weighting functions. , 1988, Bulletin of mathematical biology.

[60]  David Eppstein,et al.  Speeding up dynamic programming , 1988, [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science.

[61]  Raffaele Giancarlo,et al.  Speeding up Dynamic Programming with Applications to Molecular Biology , 1989, Theor. Comput. Sci..

[62]  Gad M. Landau,et al.  Fast Parallel and Serial Approximate String Matching , 1989, J. Algorithms.

[63]  D. Lipman,et al.  Trees, stars, and multiple biological sequence alignment , 1989 .

[64]  Daniel J. Kleitman,et al.  An Almost Linear Time Algorithm for Generalized Matrix Searching , 1990, SIAM J. Discret. Math..

[65]  Zvi Galil,et al.  An Improved Algorithm for Approximate String Matching , 1989, SIAM J. Comput..

[66]  Alok Aggarwal,et al.  Applications of generalized matrix searching to geometric algorithms , 1990, Discret. Appl. Math..

[67]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .