Grammar Index By Induced Suffix Sorting

Pattern matching is the most central task for text indices. Most recent indices leverage compression techniques to make pattern matching feasible for massive but highly-compressible datasets. Within this kind of indices, we propose a new compressed text index built upon a grammar compression based on induced suffix sorting [Nunes et al., DCC’18]. We show that this grammar exhibits a locality sensitive parsing property, which allows us to specify, given a pattern P , certain substrings of P , called cores, that are similarly parsed in the text grammar whenever these occurrences are extensible to occurrences of P . Supported by the cores, given a pattern of length m, we can locate all its occ occurrences in a text T of length n within O(m lg |S|+ occC lg |S| lgn+ occ) time, where S is the set of all characters and nonterminals, occ is the number of occurrences, and occC is the number of occurrences of a chosen core C of P in the right hand side of all production rules of the grammar of T . Our grammar index requiresO(g) words of space and can be built in O(n) time usingO(g) working space, where g is the sum of the right hand sides of all production rules. We underline the strength of our grammar index with an exhaustive practical evaluation that gives evidence that our proposed solution excels at locating long patterns in highlyrepetitive texts. Our implementation is available at https://github.com/TooruAkagi/GCIS_Index.

[1]  I Tomohiro,et al.  Deterministic Sparse Suffix Sorting in the Restore Model , 2020, ACM Trans. Algorithms.

[2]  Hiroshi Sakamoto,et al.  ESP-index: A compressed index based on edit-sensitive parsing , 2011, J. Discrete Algorithms.

[3]  Uzi Vishkin,et al.  Efficient approximate and dynamic matching of patterns using a labeling paradigm , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[4]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[5]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[6]  Jeffrey Shallit,et al.  Decision Algorithms for Fibonacci-Automatic Words, with Applications to Pattern Avoidance , 2014, ArXiv.

[7]  Dominik Kempa,et al.  At the roots of dictionary compression: string attractors , 2017, STOC.

[8]  Peter Elias,et al.  Universal codeword sets and representations of the integers , 1975, IEEE Trans. Inf. Theory.

[9]  Gonzalo Navarro,et al.  Optimal-Time Text Indexing in BWT-runs Bounded Space , 2017, SODA.

[10]  Gonzalo Navarro,et al.  Grammar-Compressed Indexes with Logarithmic Search Time , 2020, ArXiv.

[11]  Kunihiko Sadakane,et al.  Compression with the tudocomp Framework , 2017, SEA.

[12]  Gonzalo Navarro,et al.  Improved Grammar-Based Compressed Indexes , 2012, SPIRE.

[13]  Haim Kaplan,et al.  Linear-time pointer-machine algorithms for least common ancestors, MST verification, and dominators , 1998, STOC '98.

[14]  Hiroshi Sakamoto,et al.  ESP-index: A compressed index based on edit-sensitive parsing , 2013, J. Discrete Algorithms.

[15]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[16]  Gonzalo Navarro,et al.  A grammar compressor for collections of reads with applications to the construction of the BWT , 2020, 2021 Data Compression Conference (DCC).

[17]  Kurt Mehlhorn,et al.  Maintaining dynamic sequences under equality tests in polylogarithmic time , 1994, SODA '94.

[18]  Miguel A. Martínez-Prieto,et al.  Universal indexes for highly repetitive document collections , 2016, Inf. Syst..

[19]  Juha Kärkkäinen,et al.  A Faster Grammar-Based Self-index , 2011, LATA.

[20]  Gonzalo Navarro,et al.  A Grammar Compression Algorithm Based on Induced Suffix Sorting , 2018, 2018 Data Compression Conference.

[21]  Sen Zhang,et al.  Two Efficient Algorithms for Linear Time Suffix Array Construction , 2011, IEEE Transactions on Computers.

[22]  Kunihiko Sadakane,et al.  Practical Entropy-Compressed Rank/Select Dictionary , 2006, ALENEX.

[23]  Hiroshi Sakamoto,et al.  siEDM: an efficient string index and search algorithm for edit distance with moves , 2016, Algorithms.

[24]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[25]  Gonzalo Navarro,et al.  DACs: Bringing direct access to variable-length codes , 2013, Inf. Process. Manag..

[26]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[27]  Graham Cormode,et al.  The string edit distance matching problem with moves , 2002, SODA '02.

[28]  Peter Elias,et al.  Efficient Storage and Retrieval by Content and Address of Static Files , 1974, JACM.

[29]  Rodrigo González,et al.  Compressed text indexes: From theory to practice , 2007, JEAL.

[30]  Gonzalo Navarro,et al.  Self-Indexed Grammar-Based Compression , 2011, Fundam. Informaticae.

[31]  En-Hui Yang,et al.  Grammar-based codes: A new class of universal lossless source codes , 2000, IEEE Trans. Inf. Theory.

[32]  Gonzalo Navarro,et al.  Optimal-Time Dictionary-Compressed Indexes , 2018, ACM Trans. Algorithms.

[33]  S. Muthukrishnan,et al.  On the sorting-complexity of suffix tree construction , 2000, JACM.

[34]  Hideo Bannai,et al.  Dynamic Index and LZ Factorization in Compressed Space , 2016, Stringology.

[35]  Kazuya Tsuruta,et al.  Grammar-compressed Self-index with Lyndon Words , 2020, ArXiv.

[36]  Hiroshi Sakamoto,et al.  Improved ESP-index: A Practical Self-index for Highly Repetitive Texts , 2014, SEA.

[37]  Hiroshi Sakamoto,et al.  Rpair: Rescaling RePair with Rsync , 2019, SPIRE.

[38]  Gonzalo Navarro,et al.  Grammar Compression By Induced Suffix Sorting , 2020, ArXiv.

[39]  A. Moffat,et al.  Offline dictionary-based compression , 2000, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[40]  Dan Gusfield Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[41]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[42]  Artur Jez,et al.  Approximation of grammar-based compression via recompression , 2013, Theor. Comput. Sci..