A Faster Algorithm for Approximate String Matching

We present a new algorithm for on-line approximate string matching. The algorithm is based on the simulation of a non-deterministic finite automaton built from the pattern and using the text as input. This simulation uses bit operations on a RAM machine with word length O(log n), being n the maximum size of the text. The running time achieved is O(n) for small patterns (i.e. of length m=O(√log n)), independently of the maximum number of errors allowed, k. This algorithm is then used to design two general algorithms. One of them partitions the problem into subproblems, while the other partitions the automaton into sub-automata. These algorithms are combined to obtain a hybrid algorithm which on average is O(n) for moderate k/m ratios, O(√mk/log n n) for medium ratios, and O((m−k)kn/log n) for large ratios. We show experimentally that this hybrid algorithm is faster than previous ones for moderate size of patterns and error ratios, which is the case in text searching.

[1]  Ricardo A. Baeza-Yates,et al.  Fast and Practical Approximate String Matching , 1992, Inf. Process. Lett..

[2]  Esko Ukkonen,et al.  Boyer-Moore Approach to Approximate String Matching (Extended Abstract) , 1990, SWAT.

[3]  Gad M. Landau,et al.  Fast Parallel and Serial Approximate String Matching , 1989, J. Algorithms.

[4]  Daniel Sunday,et al.  A very fast substring search algorithm , 1990, CACM.

[5]  Erkki Sutinen,et al.  On Using q-Gram Locations in Approximate String Matching , 1995, ESA.

[6]  Esko Ukkonen,et al.  Finding Approximate Patterns in Strings , 1985, J. Algorithms.

[7]  Gaston H. Gonnet,et al.  A new approach to text searching , 1989, SIGIR '89.

[8]  Peter H. Sellers,et al.  The Theory and Computation of Evolutionary Distances: Pattern Recognition , 1980, J. Algorithms.

[9]  Udi Manber,et al.  Fast text searching: allowing errors , 1992, CACM.

[10]  Esko Ukkonen,et al.  Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[11]  S. B. Needleman,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 1989 .

[12]  Esko Ukkonen,et al.  Algorithms for Approximate String Matching , 1985, Inf. Control..

[13]  Jordan Lampe,et al.  Theoretical and Empirical Comparisons of Approximate String Matching Algorithms , 1992, CPM.

[14]  Eugene W. Myers,et al.  A Subquadratic Algorithm for Approximate Regular Expression Matching , 1995, J. Algorithms.

[15]  Tadao Takaoka,et al.  Approximate Pattern Matching with Samples , 1994, ISAAC.

[16]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[17]  Alden H. Wright Approximate string matching using withinword parallelism , 1994, Softw. Pract. Exp..

[18]  Thomas G. Marr,et al.  Approximate String Matching and Local Similarity , 1994, CPM.

[19]  Uzi Vishkin,et al.  Fast String Matching with k Differences , 1988, J. Comput. Syst. Sci..

[20]  Ricardo A. Baeza-Yates,et al.  Text-Retrieval: Theory and Practice , 1992, IFIP Congress.

[21]  Zvi Galil,et al.  An Improved Algorithm for Approximate String Matching , 1989, SIAM J. Comput..