An efficient string matching algorithm with k differences for nucleotide and amino acid sequences

There are a few algorithms designed to solve the problem of the optimal alignment of one sequence, the pattern, of length m, with another, longer sequence the text, of length n. These algorithms allow mismatches, deletions and insertions. Algorithms to date run in O(mn) time. Let us define an integer, k, which is the maximal number of differences allowed. We present a simple algorithm showing that sequences can be optimally aligned in O(k2n) time. For long sequences the gain factor over the currently used algorithms is very large.

[1]  J. P. Dumas,et al.  Efficient algorithms for folding and comparing nucleic acid sequences , 1982, Nucleic Acids Res..

[2]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[3]  I. Tinoco,et al.  Estimation of Secondary Structure in Ribonucleic Acids , 1971, Nature.

[4]  M. I. Kanehisa,et al.  Pattern recognition in nucleic acid sequences. I. A general method for finding local homologies and symmetries , 1982, Nucleic Acids Res..

[5]  P H Sellers Pattern recognition in genetic sequences. , 1979, Proceedings of the National Academy of Sciences of the United States of America.

[6]  L. J. Korn,et al.  Computer analysis of nucleic acid regulatory sequences. , 1977, Proceedings of the National Academy of Sciences of the United States of America.

[7]  James W. Fickett,et al.  Fast optimal alignment , 1984, Nucleic Acids Res..

[8]  R Nussinov,et al.  An efficient code searching for sequence homology and DNA duplication. , 1983, Journal of theoretical biology.

[9]  R. Nussinov,et al.  Fast algorithm for predicting the secondary structure of single-stranded RNA. , 1980, Proceedings of the National Academy of Sciences of the United States of America.

[10]  D Sankoff,et al.  Matching sequences under deletion-insertion constraints. , 1972, Proceedings of the National Academy of Sciences of the United States of America.

[11]  J. Maizel,et al.  Enhanced graphic matrix analysis of nucleic acid and protein sequences. , 1981, Proceedings of the National Academy of Sciences of the United States of America.

[12]  D. Lipman,et al.  Rapid similarity searches of nucleic acid and protein data banks. , 1983, Proceedings of the National Academy of Sciences of the United States of America.