A Unified View to String Matching Algorithms

We present a unified view to sequential algorithms for many pattern matching problems, using a finite automaton built from the pattern which uses the text as input. We show the limitations of deterministic finite automata (DFA) and the advantages of using a bitwise simulation of non-deterministic finite automata (NFA). This approach gives very fast practical algorithms which have good complexity for small patterns on a RAM machine with word length O(log n), where n is the size of the text. For generalized string matching the time complexity is O(mn/log n) which for small patterns is linear. For approximate string matching we show that the two main known approaches to the problem are variations of the NFA simulation. For this case we present a different simulation technique which gives a running time of O(n) independently of the maximum number of errors allowed, k, for small patterns. This algorithm improves the best bit-wise or comparison based algorithms of running time O(kn) and can be used as a basic block for algorithms with good average case behavior. We also formalize previous bit-wise simulation of general NFAs achieving O(mn log log n/log n) time.

[1]  V AhoAlfred,et al.  Efficient string matching , 1975 .

[2]  Imre Simon String Matching Algorithms and Automata , 1994, Results and Trends in Theoretical Computer Science.

[3]  Zvi Galil,et al.  An Improved Algorithm for Approximate String Matching , 1989, SIAM J. Comput..

[4]  Esko Ukkonen,et al.  Finding Approximate Patterns in Strings , 1985, J. Algorithms.

[5]  Bell Telephone,et al.  Regular Expression Search Algorithm , 1968 .

[6]  M. Fischer,et al.  STRING-MATCHING AND OTHER PRODUCTS , 1974 .

[7]  Ron Y. Pinter,et al.  Efficient String Matching with Don’t-Care Patterns , 1985 .

[8]  Ricardo A. Baeza-Yates,et al.  Fast and Practical Approximate String Matching , 1992, Inf. Process. Lett..

[9]  Alden H. Wright Approximate string matching using withinword parallelism , 1994, Softw. Pract. Exp..

[10]  Karl R. Abrahamson Generalized String Matching , 1987, SIAM J. Comput..

[11]  Gaston H. Gonnet,et al.  Handbook Of Algorithms And Data Structures , 1984 .

[12]  Ricardo A. Baeza-Yates,et al.  A Faster Algorithm for Approximate String Matching , 1996, CPM.

[13]  Michael L. Fredman,et al.  Surpassing the Information Theoretic Bound with Fusion Trees , 1993, J. Comput. Syst. Sci..

[14]  Rajeev Raman,et al.  Sorting in linear time? , 1995, STOC '95.

[15]  Esko Ukkonen,et al.  Algorithms for Approximate String Matching , 1985, Inf. Control..

[16]  Borivoj Melichar Approximate String Matching by Finite Automata , 1995, CAIP.

[17]  Robert S. Boyer,et al.  A fast string searching algorithm , 1977, CACM.

[18]  Howard J. Karloff Fast Algorithms for Approximately Counting Mismatches , 1993, Inf. Process. Lett..

[19]  Eugene W. Myers,et al.  A Four Russians algorithm for regular expression pattern matching , 1992, JACM.

[20]  Uzi Vishkin,et al.  Fast String Matching with k Differences , 1988, J. Comput. Syst. Sci..

[21]  Ricardo A. Baeza-Yates,et al.  An Algorithm for String Matching with a Sequence of don't Cares , 1991, Inf. Process. Lett..

[22]  Ricardo A. Baeza-Yates,et al.  Text-Retrieval: Theory and Practice , 1992, IFIP Congress.

[23]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[24]  Udi Manber,et al.  Fast text searching: allowing errors , 1992, CACM.

[25]  Eugene W. Myers,et al.  A Subquadratic Algorithm for Approximate Regular Expression Matching , 1995, J. Algorithms.

[26]  E. Myers,et al.  Approximate matching of regular expressions. , 1989, Bulletin of mathematical biology.

[27]  Ken Thompson,et al.  Programming Techniques: Regular expression search algorithm , 1968, Commun. ACM.

[28]  Gaston H. Gonnet,et al.  A new approach to text searching , 1992, CACM.

[29]  G. H. Gonnet,et al.  Handbook of algorithms and data structures: in Pascal and C (2nd ed.) , 1991 .

[30]  A. Dermouche A Fast Algorithm for String Matching with Mismatches , 1995, Inf. Process. Lett..

[31]  C. Hancart Analyse exacte et en moyenne d'algorithmes de recherche d'un motif dans un texte , 1993 .

[32]  Gaston H. Gonnet,et al.  A new approach to text searching , 1989, SIGIR '89.

[33]  Ricardo A. Baeza-Yates,et al.  Searching Subsequences , 1991, Theor. Comput. Sci..