Comparison of Strings Belonging to the Same Family

Abstract The comparison of strings belonging to the same family can be made through the construction of a long(est) Common Subsequence (CS) that reflects structural similarities that exist between these strings. The longest CS problem is NP-complete. In this paper, we present an O( N 2 × L 2 log 2 ( L )) algorithm, where N is the number of the strings and L is the maximum length, that constructs a CS, to a family of strings, made up bu longer words appearing, approximately, in the same positions in all the strings. During each iteration, our algorithm looks for the longest common words that appear, approximately, in the same positions in all the strings, then, filters the common words found, to keep just those that assure the smallest norm between all the strings. The filtering is based on a new distance called contextual distance .

[1]  Daniel S. Hirschberg,et al.  Algorithms for the Longest Common Subsequence Problem , 1977, JACM.

[2]  R. Bellman Dynamic programming. , 1957, Science.

[3]  Dan Gusfield,et al.  Efficient algorithms for inferring evolutionary trees , 1991, Networks.

[4]  Cyril N. Alberga,et al.  String similarity and misspellings , 1967, CACM.

[5]  Arnold L. Rosenberg,et al.  Rapid identification of repeated patterns in strings, trees and arrays , 1972, STOC.

[6]  Daniel S. Hirschberg,et al.  A linear space algorithm for computing maximal common subsequences , 1975, Commun. ACM.

[7]  Thomas B. Martin,et al.  Automatic Speech and Speaker Recognition , 1979 .

[8]  Alfred V. Aho,et al.  The Design and Analysis of Computer Algorithms , 1974 .

[9]  Stuart E. Dreyfus,et al.  Applied Dynamic Programming , 1965 .

[10]  David Sankoff,et al.  Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison , 1983 .

[11]  Howard Lee Morgan,et al.  Spelling correction in systems programs , 1970, Commun. ACM.

[12]  B Henrissat,et al.  Cellulase families revealed by hydrophobic cluster analysis. , 1989, Gene.

[13]  Frederick F. Sellers,et al.  Bit loss and gain correction code , 1962, IRE Trans. Inf. Theory.

[14]  Michael S. Waterman,et al.  General methods of sequence comparison , 1984 .

[15]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.

[16]  David Maier,et al.  The Complexity of Some Problems on Subsequences and Supersequences , 1978, JACM.

[17]  D. Lipman,et al.  Trees, stars, and multiple biological sequence alignment , 1989 .

[18]  M. Waterman Mathematical Methods for DNA Sequences , 1989 .

[19]  W. A. Beyer,et al.  Some Biological Sequence Metrics , 1976 .

[20]  D. Sankoff Minimal Mutation Trees of Sequences , 1975 .

[21]  W. A. Beyer,et al.  A molecular sequence metric and evolutionary trees , 1974 .

[22]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[23]  M. I. Kanehisa,et al.  Pattern recognition in nucleic acid sequences. I. A general method for finding local homologies and symmetries , 1982, Nucleic Acids Res..

[24]  H. M. Martinez,et al.  A multiple sequence alignment program , 1986, Nucleic Acids Res..

[25]  M. Elloumi Analyse de chaînes de caractères codant des macromolécules biologiques , 1994 .

[26]  Sampath Kannan,et al.  Inferring evolutionary history from DNA sequences , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[27]  David S. Johnson,et al.  The computational complexity of inferring rooted phylogenies by parsimony , 1986 .

[28]  R. Doolittle Molecular evolution: computer analysis of protein and nucleic acid sequences. , 1990, Methods in enzymology.

[29]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[30]  D. Lipman,et al.  Rapid similarity searches of nucleic acid and protein data banks. , 1983, Proceedings of the National Academy of Sciences of the United States of America.

[31]  King-Sun Fu,et al.  A Clustering Procedure for Syntactic Patterns , 1977, IEEE Transactions on Systems, Man, and Cybernetics.

[32]  Philippe Chrétienne,et al.  An Algorithm for Finding a Common Structure Shared by a Family of Strings , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[33]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[34]  H. M. Martinez,et al.  An efficient method for finding repeats in molecular sequences , 1983, Nucleic Acids Res..

[35]  Daniel S. Hirschberg,et al.  Data compression , 1987, CSUR.