A new taxonomy of sublinear right-to-left scanning keyword pattern matching algorithms

A new taxonomy of sublinear (multiple) keyword pattern matching algorithms is presented. Based on an earlier taxonomy by the second and third authors, this new taxonomy includes not only suffix-based algorithms, but also factor- and factor-oracle-based algorithms. In particular, we show how suffix-based (Commentz-Walter like), factor- and factor-oracle-based sublinear keyword pattern matching algorithms can be seen as instantiations of a general sublinear algorithm skeleton. During processing, such algorithms shift or jump through the text in a forward or left-to-right direction, and read backward or right-to-left starting from positions in the text, i.e. they read suffixes of certain prefixes of the text. They use finite automata for efficient computation of string membership in a certain language. In addition, we show shift functions defined for the suffix-based algorithms to be reusable for factor- and factor-oracle-based algorithms. The taxonomy is based on deriving the algorithms from a common starting point by adding algorithm and problem details, to arrive at efficient or well-known algorithms. Such a presentation provides correctness arguments for the algorithms as well as clarity on how the algorithms are related to one another. In addition, it is helpful in the construction of a toolkit of the algorithms.

[1]  Edsger W. Dijkstra,et al.  Predicate Calculus and Program Semantics , 1989, Texts and Monographs in Computer Science.

[2]  Gerard Zwaan,et al.  Constructing Factor Oracles , 2003, Stringology.

[3]  Alban Mancheron,et al.  Combinatorial characterization of the language recognized by factor and suffix oracles , 2005, Int. J. Found. Comput. Sci..

[4]  Bruce W. Watson A new family of Commentz-Walter-style multiple-keyword pattern matching algorithms , 2000, Stringology.

[5]  Maxime Crochemore,et al.  Automata for Matching Patterns , 1997, Handbook of Formal Languages.

[6]  Z. Galil,et al.  Pattern matching algorithms , 1997 .

[7]  Frederick C. Mish Merriam Webster's Collegiate Dictionary , 1998 .

[8]  Gerard Zwaan,et al.  Automaton-Based Sublinear Keyword Pattern Matching , 2004, SPIRE.

[9]  Keh-Yih Su,et al.  An Efficient Algorithm for Matching Multiple Patterns , 1993, IEEE Trans. Knowl. Data Eng..

[10]  Daniel Sunday,et al.  A very fast substring search algorithm , 1990, CACM.

[11]  H.B.M. Jonkers,et al.  Abstraction, specification and implementation techniques : with an application to garbage collection , 1983 .

[12]  Alfred V. Aho,et al.  Algorithms for Finding Patterns in Strings , 1991, Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity.

[13]  William F. Smyth,et al.  Computing Patterns in Strings , 2003 .

[14]  Jun-ichi Aoe Computer Algorithms: String Pattern Matching Strategies , 1994 .

[15]  Beate Commentz-Walter,et al.  A String Matching Algorithm Fast on the Average , 1979, ICALP.

[16]  Wojciech Plandowski,et al.  Fast Practical Multi-Pattern Matching , 1999, Inf. Process. Lett..

[17]  Gerard Zwaan,et al.  A new taxonomy of sublinear keyword pattern matching algorithms , 2004 .

[18]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[19]  Wojciech Plandowski,et al.  Speeding Up Two String-Matching Algorithms , 1992, STACS.

[20]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[21]  Wojciech Rytter,et al.  Jewels of stringology : text algorithms , 2002 .

[22]  M. Crochemore,et al.  Algorithms on Strings: Tools , 2007 .

[23]  Bruce W. Watson,et al.  SPARE Parts: a C++ toolkit for string pattern recognition , 2004, Softw. Pract. Exp..

[24]  Robert S. Boyer,et al.  A fast string searching algorithm , 1977, CACM.

[25]  Gonzalo Navarro,et al.  Fast and flexible string matching by combining bit-parallelism and suffix automata , 2000, JEAL.

[26]  Gabor Barla-Szabo,et al.  A taxonomy of graph representations , 2006 .

[27]  Grzegorz Rozenberg,et al.  Handbook of Formal Languages , 1997, Springer Berlin Heidelberg.

[28]  Udi Manber,et al.  A FAST ALGORITHM FOR MULTI-PATTERN SEARCHING , 1999 .

[29]  Gonzalo Navarro,et al.  Flexible Pattern Matching in Strings: Practical On-Line Search Algorithms for Texts and Biological Sequences , 2002 .

[30]  Gerard Zwaan,et al.  A Taxonomy of Sublinear Multiple Keyword Pattern Matching Algorithms , 1996, Sci. Comput. Program..

[31]  Maxime Crochemore,et al.  Efficient Experimental String Matching by Weak Factor Recognition , 2001, CPM.

[32]  Arto Salomaa,et al.  ICALP'88: Proceedings of the 15th International Colloquium on Automata, Languages and Programming , 1988 .

[33]  R. Nigel Horspool,et al.  Practical fast searching in strings , 1980, Softw. Pract. Exp..

[34]  Edsger W. Dijkstra,et al.  A Discipline of Programming , 1976 .

[35]  Maxime Crochemore,et al.  Algorithms on strings , 2007 .

[36]  J. Van Leeuwen,et al.  Handbook of theoretical computer science - Part A: Algorithms and complexity; Part B: Formal models and semantics , 1990 .

[37]  J.P.H.W. van den Eijnde,et al.  Program derivation in acyclic graphs and related problems , 1992 .