A compressed dynamic self-index for highly repetitive text collections

We present a novel compressed dynamic self-index for highly repetitive text collections. Signature encoding is a compressed dynamic self-index for highly repetitive texts and has a large disadvantage that the pattern search for short patterns is slow. We improve this disadvantage for faster pattern search by leveraging an idea behind truncated suffix tree and present the first compressed dynamic self-index named TST-index that supports not only fast pattern search but also dynamic update operation of index for highly repetitive texts. Experiments using a benchmark dataset of highly repetitive texts show that the pattern search of TST-index is significantly improved.

[1]  Gonzalo Navarro,et al.  Storage and Retrieval of Individual Genomes , 2009, RECOMB.

[2]  Gonzalo Navarro A Self-index on Block Trees , 2017, SPIRE.

[3]  Dominik Kempa,et al.  At the roots of dictionary compression: string attractors , 2017, STOC.

[4]  Gonzalo Navarro,et al.  Optimal-Time Text Indexing in BWT-runs Bounded Space , 2017, SODA.

[5]  Hideo Bannai,et al.  Fully Dynamic Data Structure for LCE Queries in Compressed Space , 2016, MFCS.

[6]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..

[7]  Gonzalo Navarro,et al.  Optimal-Time Dictionary-Compressed Indexes , 2018, ACM Trans. Algorithms.

[8]  Erkki Sutinen,et al.  Lempel—Ziv Index for q -Grams , 1998, Algorithmica.

[9]  Gadiel Seroussi,et al.  Space-efficient representation of truncated suffix trees, with applications to Markov order estimation , 2015, Theor. Comput. Sci..

[10]  Gonzalo Navarro,et al.  Universal Compressed Text Indexing , 2018, Theor. Comput. Sci..

[11]  Gonzalo Navarro,et al.  Succinct Suffix Arrays based on Run-Length Encoding , 2005, Nord. J. Comput..

[12]  Kurt Mehlhorn,et al.  Maintaining dynamic sequences under equality tests in polylogarithmic time , 1994, SODA '94.

[13]  G. Brodal,et al.  Dynamic Pattern Matching , 2009 .

[14]  L FredmanMichael,et al.  Storing a Sparse Table with 0(1) Worst Case Access Time , 1984 .

[15]  Philip Bille,et al.  Time-space trade-offs for Lempel-Ziv compressed indexing , 2018, Theor. Comput. Sci..

[16]  Hideo Bannai,et al.  Dynamic Index and LZ Factorization in Compressed Space , 2016, Stringology.

[17]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[18]  Donald R. Morrison,et al.  PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric , 1968, J. ACM.

[19]  Faith Ellen,et al.  Optimal Bounds for the Predecessor Problem and Related Problems , 2002, J. Comput. Syst. Sci..

[20]  Juha Kärkkäinen,et al.  Fixed Block Compression Boosting in FM-Indexes: Theory and Practice , 2018, Algorithmica.

[21]  Gonzalo Navarro,et al.  Improved Grammar-Based Compressed Indexes , 2012, SPIRE.

[22]  Hideo Bannai,et al.  Small-Space LCE Data Structure with Constant-Time Queries , 2017, MFCS.

[23]  Hideo Bannai,et al.  Small-space encoding LCE data structure with constant-time queries , 2017, ArXiv.

[24]  Friedhelm Meyer auf der Heide,et al.  Dynamic perfect hashing: upper and lower bounds , 1988, [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science.

[25]  Mikko Berggren Ettienne,et al.  Compressed Indexing with Signature Grammars , 2018, LATIN.

[26]  Gonzalo Navarro,et al.  On compressing and indexing repetitive sequences , 2013, Theor. Comput. Sci..

[27]  Hector Ferrada,et al.  Hybrid Indexing Revisited , 2018, ALENEX.

[28]  Hiroshi Sakamoto,et al.  Improved ESP-index: A Practical Self-index for Highly Repetitive Texts , 2014, SEA.

[29]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[30]  Joong Chae Na,et al.  Truncated suffix trees and their application to data compression , 2003, Theor. Comput. Sci..

[31]  Hector Ferrada,et al.  Hybrid indexes for repetitive datasets , 2013, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.