AnO(ND) difference algorithm and its variations

The problems of finding a longest common subsequence of two sequencesA andB and a shortest edit script for transformingA intoB have long been known to be dual problems. In this paper, they are shown to be equivalent to finding a shortest/longest path in an edit graph. Using this perspective, a simpleO(ND) time and space algorithm is developed whereN is the sum of the lengths ofA andB andD is the size of the minimum edit script forA andB. The algorithm performs well when differences are small (sequences are similar) and is consequently fast in typical applications. The algorithm is shown to haveO(N+D2) expected-time performance under a basic stochastic model. A refinement of the algorithm requires onlyO(N) space, and the use of suffix trees leads to anO(N logN+D2) time variation.

[1]  Edsger W. Dijkstra,et al.  A note on two problems in connexion with graphs , 1959, Numerische Mathematik.

[2]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[3]  Donald E. Knuth,et al.  Sorting and Searching , 1973 .

[4]  Donald E. Knuth,et al.  The art of computer programming: sorting and searching (volume 3) , 1973 .

[5]  Donald E. Knuth,et al.  The Art of Computer Programming, Vol. 3: Sorting and Searching , 1974 .

[6]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[7]  Daniel S. Hirschberg,et al.  A linear space algorithm for computing maximal common subsequences , 1975, Commun. ACM.

[8]  Marc J. Rochkind,et al.  The source code control system , 1975, IEEE Transactions on Software Engineering.

[9]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[10]  Alfred V. Aho,et al.  Bounds on the Complexity of the Longest Common Subsequence Problem , 1976, J. ACM.

[11]  Thomas G. Szymanski,et al.  A fast algorithm for computing longest common subsequences , 1977, CACM.

[12]  Daniel S. Hirschberg,et al.  Algorithms for the Longest Common Subsequence Problem , 1977, JACM.

[13]  Daniel S. Hirschberg,et al.  An Information-Theoretic Lower Bound for the Longest Common Subsequence Problem , 1977, Inf. Process. Lett..

[14]  Patrick A. V. Hall,et al.  Approximate String Matching , 1994, Encyclopedia of Algorithms.

[15]  Mike Paterson,et al.  A Faster Algorithm Computing String Edit Distances , 1980, J. Comput. Syst. Sci..

[16]  Paul Abrahams Proceedings of the ACM SIGPLAN SIGOA symposium on Text manipulation , 1981 .

[17]  James Gosling A redisplay algorithm , 1981 .

[18]  David Sankoff,et al.  Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison , 1983 .

[19]  Alfred V. Aho,et al.  Data Structures and Algorithms , 1983 .

[20]  Walter F. Tichy,et al.  The string-to-string correction problem with block moves , 1984, TOCS.

[21]  Robert E. Tarjan,et al.  Fast Algorithms for Finding Nearest Common Ancestors , 1984, SIAM J. Comput..

[22]  Eugene W. Myers,et al.  A file comparison program , 1985, Softw. Pract. Exp..

[23]  Yahiko Kambayashi,et al.  A longest common subsequence algorithm suitable for similar text strings , 1982, Acta Informatica.