Computing q-Gram Non-overlapping Frequencies on SLP Compressed Texts

Length-q substrings, or q -grams, can represent important characteristics of text data, and determining the frequencies of all q -grams contained in the data is an important problem with many applications in the field of data mining and machine learning. In this paper, we consider the problem of calculating the non-overlapping frequencies of all q -grams in a text given in compressed form, namely, as a straight line program (SLP). We show that the problem can be solved in O (q 2n ) time and O (qn ) space where n is the size of the SLP. This generalizes and greatly improves previous work (Inenaga & Bannai, 2009) which solved the problem only for q =2 in O (n 4logn ) time and O (n 3) space.

[1]  Wojciech Rytter,et al.  An Efficient Pattern-Matching Algorithm for Strings with Short Descriptions , 1997, Nord. J. Comput..

[2]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[3]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[4]  A. Apostolico,et al.  Off-line compression by greedy textual substitution , 2000, Proceedings of the IEEE.

[5]  Ayumi Shinohara,et al.  Efficient algorithms to compute compressed longest common substrings and compressed palindromes , 2009, Theor. Comput. Sci..

[6]  Gad M. Landau,et al.  A Unified Algorithm for Accelerating Edit-Distance Computation via Text-Compression , 2009, STACS.

[7]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[8]  Yury Lifshits,et al.  Processing Compressed Texts: A Tractability Border , 2007, CPM.

[9]  Gad M. Landau,et al.  Random access to grammar-compressed strings , 2010, SODA '11.

[10]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[11]  Hideo Bannai,et al.  Fast q-gram Mining on SLP Compressed Strings , 2011, SPIRE.

[12]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[13]  Hideo Bannai,et al.  Fast q-gram mining on SLP compressed strings , 2011, J. Discrete Algorithms.

[14]  Hideo Bannai,et al.  Finding Characteristic Substrings from Compressed Texts , 2012, Int. J. Found. Comput. Sci..

[15]  Peter Sanders,et al.  Simple Linear Work Suffix Array Construction , 2003, ICALP.

[16]  Franco P. Preparata,et al.  Data structures and algorithms for the string statistics problem , 1996, Algorithmica.

[17]  Hiroki Arimura,et al.  Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications , 2001, CPM.

[18]  Anna Pagh,et al.  Solving the String Statistics Problem in Time O(n log n) , 2002, ICALP.

[19]  Craig G. Nevill-Manning,et al.  Compression by induction of hierarchical grammars , 1994, Proceedings of IEEE Data Compression Conference (DCC'94).

[20]  Gary Benson,et al.  Efficient two-dimensional compressed matching , 1992, Data Compression Conference, 1992..

[21]  Alistair Moffat,et al.  Off-line dictionary-based compression , 1999, Proceedings of the IEEE.