Faster pattern matching with character classes using prime number encoding

In pattern matching with character classes the goal is to find all occurrences of a pattern of length m in a text of length n, where each pattern position consists of an allowed set of characters from a finite alphabet @S. We present an FFT-based algorithm that uses a novel prime-numbers encoding scheme, which is logn/logm times faster than the fastest extant approaches, which are based on boolean convolutions. In particular, if m^|^@S^|=n^O^(^1^), our algorithm runs in time O(nlogm), matching the complexity of the fastest techniques for wildcard matching, a special case of our problem. A major advantage of our algorithm is that it allows a tradeoff between the running time and the RAM word size. Our algorithm also speeds up solutions to approximate matching with character classes problems-namely, matching with k mismatches and Hamming distance, as well as to the subset matching problem.

[1]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[2]  Richard Cole,et al.  Tree pattern matching and subset matching in randomized O(nlog3m) time , 1997, STOC '97.

[3]  Roded Sharan,et al.  DEFOG: a practical scheme for deciphering families of genes. , 2002, Genomics.

[4]  Daniel J. Bernstein,et al.  Prime sieves using binary quadratic forms , 2003, Math. Comput..

[5]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[6]  Moshe Lewenstein,et al.  Faster algorithms for string matching with k mismatches , 2000, SODA '00.

[7]  Richard Cole,et al.  Verifying candidate matches in sparse and wildcard matching , 2002, STOC '02.

[8]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[9]  Raphaël Clifford,et al.  Simple deterministic wildcard matching , 2007, Inf. Process. Lett..

[10]  M. Fischer,et al.  STRING-MATCHING AND OTHER PRODUCTS , 1974 .

[11]  Ron Y. Pinter,et al.  Efficient String Matching with Don’t-Care Patterns , 1985 .

[12]  Dan Gusfield,et al.  Algorithms on strings , 1997 .

[13]  Karl R. Abrahamson Generalized String Matching , 1987, SIAM J. Comput..

[14]  Piotr Indyk,et al.  Faster algorithms for string matching problems: matching the convolution bound , 1998, Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No.98CB36280).

[15]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[16]  Costas S. Iliopoulos,et al.  Efficient (δ, γ)-pattern-matching with don't cares , 2009 .

[17]  Manindra Agrawal,et al.  PRIMES is in P , 2004 .

[18]  Clifford Stein,et al.  Introduction to Algorithms, 2nd edition. , 2001 .

[19]  Ron Shamir,et al.  The canine olfactory subgenome. , 2004, Genomics.

[20]  J. Rosser,et al.  Approximate formulas for some functions of prime numbers , 1962 .

[21]  Gaston H. Gonnet,et al.  A new approach to text searching , 1989, SIGIR '89.

[22]  J. Ott,et al.  The p53MH algorithm and its application in detecting p53-responsive genes , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[23]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[24]  Ron Shamir,et al.  Theory and applications , 2004 .

[25]  Robert S. Boyer,et al.  A fast string searching algorithm , 1977, CACM.

[26]  Peter Cliord Simple Deterministic Wildcard Matching , 2006 .