Faster Subsequence and Don't-Care Pattern Matching on Compressed Texts

Subsequence pattern matching problems on compressed text were first considered by Cegielski et al. (Window Subsequence Problems for Compressed Texts, Proc. CSR 2006, LNCS 3967, pp. 127-136), where the principal problem is: given a string T represented as a straight line program (SLP) T of size n, a string P of size m, compute the number of minimal subsequence occurrences of P in T. We present an O(nm) time algorithm for solving all variations of the problem introduced by Cegielski et al. This improves the previous best known algorithm of Tiskin (Towards approximate matching in compressed strings: Local subsequence recognition, Proc. CSR 2011), which runs in O(nm log m) time. We further show that our algorithms can be modified to solve a wider range of problems in the same O(nm) time complexity, and present the first matching algorithms for patterns containing VLDC (variable length don't care) symbols, as well as for patterns containing FLDC (fixed length don't care) symbols, on SLP compressed texts.

[1]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[2]  Ayumi Shinohara,et al.  An Improved Pattern Matching Algorithm for Strings in Terms of Straight-Line Programs , 1997, CPM.

[3]  Markus Lohrey,et al.  Querying and Embedding Compressed Texts , 2006, MFCS.

[4]  Alexander Tiskin Towards Approximate Matching in Compressed Strings: Local Subsequence Recognition , 2011, CSR.

[5]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[6]  A. Moffat,et al.  Offline dictionary-based compression , 2000, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[7]  Wojciech Rytter,et al.  Compressed string-matching in standard Sturmian words , 2009, Theor. Comput. Sci..

[8]  Ricardo A. Baeza-Yates,et al.  Searching Subsequences , 1991, Theor. Comput. Sci..

[9]  Craig G. Nevill-Manning,et al.  Compression by induction of hierarchical grammars , 1994, Proceedings of IEEE Data Compression Conference (DCC'94).

[10]  Gonzalo Navarro,et al.  Self-indexed Text Compression Using Straight-Line Programs , 2009, MFCS.

[11]  Heikki Mannila,et al.  Discovery of Frequent Episodes in Event Sequences , 1997, Data Mining and Knowledge Discovery.

[12]  Wojciech Rytter,et al.  Grammar Compression, LZ-Encodings, and String Algorithms with Implicit Input , 2004, ICALP.

[13]  Gad M. Landau,et al.  A Unified Algorithm for Accelerating Edit-Distance Computation via Text-Compression , 2009, STACS.

[14]  Yury Lifshits,et al.  Window Subsequence Problems for Compressed Texts , 2006, CSR.

[15]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[16]  Wojciech Rytter,et al.  An Efficient Pattern-Matching Algorithm for Strings with Short Descriptions , 1997, Nord. J. Comput..

[17]  Alexander Tiskin Faster subsequence recognition in compressed strings , 2007, ArXiv.