Finger Search in Grammar-Compressed Strings

Grammar-based compression, where one replaces a long string by a small context-free grammar that generates the string, is a simple and powerful paradigm that captures many popular compression schemes. Given a grammar, the random access problem is to compactly represent the grammar while supporting random access, that is, given a position in the original uncompressed string report the character at that position. In this paper we study the random access problem with the finger search property, that is, the time for a random access query should depend on the distance between a specified index f, called the finger, and the query index i. We consider both a static variant, where we first place a finger and subsequently access indices near the finger efficiently, and a dynamic variant where also moving the finger such that the time depends on the distance moved is supported. Let n be the size the grammar, and let N be the size of the string. For the static variant we give a linear space representation that supports placing the finger in O(log N) time and subsequently accessing in O(log D) time, where D is the distance between the finger and the accessed index. For the dynamic variant we give a linear space representation that supports placing the finger in O(log N) time and accessing and moving the finger in O(log D + log log N) time. Compared to the best linear space solution to random access, we improve a O(log N) query bound to O(log D) for the static variant and to O(log D + log log N) for the dynamic variant, while maintaining linear space. As an application of our results we obtain an improved solution to the longest common extension problem in grammar compressed strings. To obtain our results, we introduce several new techniques of independent interest, including a novel van Emde Boas style decomposition of grammars.

[1]  Dake He,et al.  Efficient universal lossless data compression algorithms based on a greedy sequential grammar transform .2. With context models , 2000, IEEE Trans. Inf. Theory.

[2]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[3]  Simon J. Puglisi,et al.  Approximate pattern matching in LZ77-compressed texts , 2015, J. Discrete Algorithms.

[4]  Rudolf Fleischer A Simple Balanced Search Tree with O(1) Worst-Case Update Time , 1996, Int. J. Found. Comput. Sci..

[5]  Stefano Lonardi,et al.  Compression of biological sequences by greedy off-line textual substitution , 2000, Proceedings DCC 2000. Data Compression Conference.

[6]  Terry A. Welch,et al.  A Technique for High-Performance Data Compression , 1984, Computer.

[7]  S. Arikawa,et al.  Byte Pair Encoding: a Text Compression Scheme That Accelerates Pattern Matching , 1999 .

[8]  Philip Bille,et al.  Compressed Subsequence Matching and Packed Tree Coloring , 2014, Algorithmica.

[9]  Philip Bille,et al.  Finger Search in Grammar-Compressed Strings , 2016, FSTTCS.

[10]  Gonzalo Navarro,et al.  Self-Indexed Grammar-Based Compression , 2011, Fundam. Informaticae.

[11]  En-Hui Yang,et al.  Grammar-based codes: A new class of universal lossless source codes , 2000, IEEE Trans. Inf. Theory.

[12]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[13]  Pawel Gawrychowski Faster Algorithm for Computing the Edit Distance between SLP-Compressed Strings , 2012, SPIRE.

[14]  Philip Bille,et al.  Sparse Suffix Tree Construction in Small Space , 2013, ICALP.

[15]  Simon J. Puglisi,et al.  Block Graphs in Practice , 2014, ICABD.

[16]  Rajeev Raman,et al.  A Constant Update Time Finger Search Tree , 1990, Inf. Process. Lett..

[17]  Leonidas J. Guibas,et al.  A new representation for linear lists , 1977, STOC '77.

[18]  Guy E. Blelloch,et al.  Space-efficient finger search on degree-balanced search trees , 2003, SODA '03.

[19]  Yasuo Tabei,et al.  Access, Rank, and Select in Grammar-compressed Strings , 2015, ESA.

[20]  Ayumi Shinohara,et al.  Detecting Regularities on Grammar-Compressed Strings , 2013, MFCS.

[21]  Gonzalo Navarro,et al.  Grammar compressed sequences with rank/select support , 2014, J. Discrete Algorithms.

[22]  Sartaj Sahni,et al.  Handbook Of Data Structures And Applications (Chapman & Hall/Crc Computer and Information Science Series.) , 2004 .

[23]  Yasuo Tabei,et al.  Queries on LZ-Bounded Encodings , 2014, 2015 Data Compression Conference.

[24]  Philip Gage,et al.  A new algorithm for data compression , 1994 .

[25]  Stefano Lonardi,et al.  Some theory and practice of greedy off-line textual substitution , 1998, Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225).

[26]  Hideo Bannai,et al.  Fully Dynamic Data Structure for LCE Queries in Compressed Space , 2016, MFCS.

[27]  Richard M. Karp,et al.  Efficient Randomized Pattern-Matching Algorithms , 1987, IBM J. Res. Dev..

[28]  Peter van Emde Boas,et al.  Design and implementation of an efficient priority queue , 1976, Mathematical systems theory.

[29]  Ian H. Witten,et al.  Identifying Hierarchical Structure in Sequences: A linear-time algorithm , 1997, J. Artif. Intell. Res..

[30]  Ayumi Shinohara,et al.  Detecting regularities on grammar-compressed strings , 2015, Inf. Comput..

[31]  Gerth Stølting Brodal,et al.  Finger Search Trees , 2004, Handbook of Data Structures and Applications.

[32]  Michael A. Bender,et al.  The LCA Problem Revisited , 2000, LATIN.

[33]  S. Muthukrishnan,et al.  Perfect Hashing for Strings: Formalization and Algorithms , 1996, CPM.

[34]  S. Rao Kosaraju,et al.  Localized search in sorted lists , 1981, STOC '81.

[35]  Robert E. Tarjan,et al.  Self-adjusting binary search trees , 1985, JACM.

[36]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[37]  Gad M. Landau,et al.  Random Access to Grammar-Compressed Strings and Trees , 2015, SIAM J. Comput..

[38]  Stephen Alstrup,et al.  Marked ancestor problems , 1998, Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No.98CB36280).

[39]  Hideo Bannai,et al.  Compressed automata for dictionary matching , 2015, Theor. Comput. Sci..

[40]  Alistair Moffat,et al.  Off-line dictionary-based compression , 1999, Proceedings of the IEEE.

[41]  Hideo Bannai,et al.  Computing Convolution on Grammar-Compressed Text , 2013, 2013 Data Compression Conference.

[42]  Juha Kärkkäinen,et al.  LZ77-Based Self-indexing with Faster Pattern Matching , 2014, LATIN.

[43]  Philip Bille,et al.  Algorithms and data structures for grammar-compressed strings , 2015 .

[44]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[45]  Pamela C. Cosman,et al.  Universal lossless compression via multilevel pattern matching , 2000, IEEE Trans. Inf. Theory.

[46]  Enno Ohlebusch,et al.  Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction , 2013 .

[47]  Hideo Bannai,et al.  Dynamic index, LZ factorization, and LCE queries in compressed space , 2015, ArXiv.

[48]  Hideo Bannai,et al.  LZD Factorization: Simple and Practical Online Grammar Compression with Variable-to-Fixed Encoding , 2015, CPM.

[49]  Christos Makris,et al.  Optimal finger search trees in the pointer machine , 2002, STOC '02.

[50]  En-Hui Yang,et al.  Efficient universal lossless data compression algorithms based on a greedy sequential grammar transform - Part one: Without context models , 2000, IEEE Trans. Inf. Theory.

[51]  Andrew Chi-Chih Yao,et al.  An Almost Optimal Algorithm for Unbounded Searching , 1976, Inf. Process. Lett..

[52]  A. Apostolico,et al.  Off-line compression by greedy textual substitution , 2000, Proceedings of the IEEE.

[53]  Wojciech Rytter,et al.  Application of Lempel-Ziv factorization to the approximation of grammar-based compression , 2002, Theor. Comput. Sci..

[54]  Pawel Gawrychowski,et al.  Bookmarks in Grammar-Compressed Strings , 2016, SPIRE.

[55]  Kurt Mehlhorn,et al.  A new data structure for representing sorted lists , 1980, Acta Informatica.

[56]  Elad Verbin,et al.  Data Structure Lower Bounds on Random Access to Grammar-Compressed Strings , 2013, CPM.

[57]  Abhi Shelat,et al.  The smallest grammar problem , 2005, IEEE Transactions on Information Theory.

[58]  Philip Bille,et al.  Fingerprints in compressed strings , 2017, J. Comput. Syst. Sci..

[59]  Igor Potapov,et al.  Real-time traversal in grammar-based compressed files , 2005, Data Compression Conference.

[60]  Raju Uma,et al.  A New Algorithm For Data Compression , 2013 .

[61]  William Pugh,et al.  Skip Lists: A Probabilistic Alternative to Balanced Trees , 1989, WADS.

[62]  Juha Kärkkäinen,et al.  A Faster Grammar-Based Self-index , 2011, LATA.

[63]  Cecilia R. Aragon,et al.  Randomized search trees , 2005, Algorithmica.