Faster algorithm of string comparison

In many applications, it is necessary to determine the field similarity. Our paper introduces a package of substring-based new algorithms to determine Field Similarity. Combined together, our new algorithms not only achieves higher accuracy, but also gains the time complexity O(knm) (k<0.75) for the worst case, O( β*n) where β<6 for the average case and O(1) for the best case. Throughout the paper, we use the approach of comparative examples to show the higher accuracy of our algorithms compared to that proposed in Lee et al. [1]. Theoretical analysis, concrete examples and experimental results show that our algorithms can significantly improve the accuracy and time complexity of the calculation of field similarity.

[1]  S. Muthukrishnan,et al.  Approximate nearest neighbors and sequence comparison with block operations , 2000, STOC '00.

[2]  Karl R. Abrahamson Generalized String Matching , 1987, SIAM J. Comput..

[3]  Richard Cole,et al.  Approximate string matching: a simpler faster algorithm , 2002, SODA '98.

[4]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[5]  Howard J. Karloff Fast Algorithms for Approximately Counting Mismatches , 1993, Inf. Process. Lett..

[6]  David J. DeWitt,et al.  Duplicate record elimination in large data files , 1983, TODS.

[7]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[8]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[9]  Richard M. Karp,et al.  Efficient Randomized Pattern-Matching Algorithms , 1987, IBM J. Res. Dev..

[10]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[11]  Hongjun Lu,et al.  Cleansing Data for Mining and Warehousing , 1999, DEXA.

[12]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[13]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[14]  Uzi Vishkin,et al.  Communication complexity of document exchange , 1999, SODA '00.

[15]  Arnold L. Rosenberg,et al.  Rapid identification of repeated patterns in strings, trees and arrays , 1972, STOC.

[16]  Stephen Alstrup,et al.  Pattern matching in dynamic texts , 2000, SODA '00.

[17]  Uzi Vishkin,et al.  Efficient approximate and dynamic matching of patterns using a labeling paradigm , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[18]  Kurt Mehlhorn,et al.  Maintaining dynamic sequences under equality tests in polylogarithmic time , 1994, SODA '94.

[19]  Gad M. Landau,et al.  Introducing efficient parallelism into approximate string matching and a new serial algorithm , 1986, STOC '86.

[20]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[21]  Moshe Lewenstein,et al.  Faster algorithms for string matching with k mismatches , 2000, SODA '00.

[22]  Uzi Vishkin,et al.  Symmetry breaking for suffix tree construction , 1994, STOC '94.

[23]  Zvi Galil,et al.  Open Problems in Stringology , 1985 .