Speeding Up q-Gram Mining on Grammar-Based Compressed Texts

We present an efficient algorithm for calculating q-gram frequencies on strings represented in compressed form, namely, as a straight line program (SLP). Given an SLP $\mathcal{T}$ of size n that represents string T, the algorithm computes the occurrence frequencies of allq-grams in T, by reducing the problem to the weighted q-gram frequencies problem on a trie-like structure of size $m = |T|-\mathit{dup}(q,\mathcal{T})$, where $\mathit{dup}(q,\mathcal{T})$ is a quantity that represents the amount of redundancy that the SLP captures with respect to q-grams. The reduced problem can be solved in linear time. Since m=O(qn), the running time of our algorithm is $O(\min\{|T|-\mathit{dup}(q,\mathcal{T}),qn\})$, improving our previous O(qn) algorithm when q=Ω(|T|/n).

[1]  James A. Storer,et al.  Data compression via textual substitution , 1982, JACM.

[2]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[3]  Peter Sanders,et al.  Linear work suffix array construction , 2006, JACM.

[4]  Ayumi Shinohara,et al.  Speeding Up Pattern Matching by Text Compression , 2000, CIAC.

[5]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[6]  Gad M. Landau,et al.  A Unified Algorithm for Accelerating Edit-Distance Computation via Text-Compression , 2009, STACS.

[7]  Gad M. Landau,et al.  A Subquadratic Sequence Alignment Algorithm for Unrestricted Scoring Matrices , 2003, SIAM J. Comput..

[8]  A. Moffat,et al.  Offline dictionary-based compression , 2000, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[9]  Fabrizio Luccio,et al.  Compressing and indexing labeled trees, with applications , 2009, JACM.

[10]  Hideo Bannai,et al.  Finding Characteristic Substrings from Compressed Texts , 2012, Int. J. Found. Comput. Sci..

[11]  Hideo Bannai,et al.  Fast q-gram mining on SLP compressed strings , 2011, J. Discrete Algorithms.

[12]  Uzi Vishkin,et al.  Finding Level-Ancestors in Trees , 1994, J. Comput. Syst. Sci..

[13]  Igor Potapov,et al.  Real-time traversal in grammar-based compressed files , 2005, Data Compression Conference.

[14]  Gonzalo Navarro,et al.  Self-Indexed Grammar-Based Compression , 2011, Fundam. Informaticae.

[15]  Michael A. Bender,et al.  The Level Ancestor Problem Simplified , 2002, LATIN.

[16]  Paul F. Dietz Finding Level-Ancestors in Dynamic Trees , 1991, WADS.

[17]  Hiroki Arimura,et al.  Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications , 2001, CPM.

[18]  Craig G. Nevill-Manning,et al.  Compression by induction of hierarchical grammars , 1994, Proceedings of IEEE Data Compression Conference (DCC'94).

[19]  Wojciech Rytter,et al.  Application of Lempel-Ziv factorization to the approximation of grammar-based compression , 2002, Theor. Comput. Sci..

[20]  Tetsuo Shibuya Constructing the Suffix Tree of a Tree with a Large Alphabet , 1999, ISAAC.

[21]  Terry A. Welch,et al.  A Technique for High-Performance Data Compression , 1984, Computer.

[22]  Wojciech Rytter,et al.  An Efficient Pattern-Matching Algorithm for Strings with Short Descriptions , 1997, Nord. J. Comput..

[23]  Pawel Gawrychowski,et al.  Pattern Matching in Lempel-Ziv Compressed Strings: Fast, Simple, and Deterministic , 2011, ESA.