Dynamic Relative Compression, Dynamic Partial Sums, and Substring Concatenation

Given a static reference string R and a source string S, a relative compression of S with respect to R is an encoding of S as a sequence of references to substrings of R. Relative compression schemes are a classic model of compression and have recently proved very successful for compressing highly-repetitive massive data sets such as genomes and web-data. We initiate the study of relative compression in a dynamic setting where the compressed source string S is subject to edit operations. The goal is to maintain the compressed representation compactly, while supporting edits and allowing efficient random access to the (uncompressed) source string. We present new data structures that achieve optimal time for updates and queries while using space linear in the size of the optimal relative compression, for nearly all combinations of parameters. We also present solutions for restricted and extended sets of updates. To achieve these results, we revisit the dynamic partial sums problem and the substring concatenation problem. We present new optimal or near optimal bounds for these problems. Plugging in our new results we also immediately obtain new bounds for the string indexing for patterns with wildcards problem and the dynamic text and static pattern matching problem.

[1]  Michael E. Saks,et al.  The cell probe complexity of dynamic data structures , 1989, STOC '89.

[2]  Moshe Lewenstein,et al.  Space-Efficient String Indexing for Wildcard Pattern Matching , 2014, STACS.

[3]  James A. Storer,et al.  The macro model for data compression (Extended Abstract) , 1978, STOC '78.

[4]  Rajeev Raman,et al.  Succinct Dynamic Data Structures , 2001, WADS.

[5]  Paul F. Dietz Optimal Algorithms for List Indexing and Subset Rank , 1989, WADS.

[6]  Wing-Kai Hon,et al.  Succinct data structures for Searchable Partial Sums with optimal worst-case performance , 2011, Theor. Comput. Sci..

[7]  Kurt Mehlhorn,et al.  Bounded Ordered Dictionaries in O(log log N) Time and O(n) Space , 1990, Information Processing Letters.

[8]  Philip Bille,et al.  Dynamic Relative Compression, Dynamic Partial Sums, and Substring Concatenation , 2016, ISAAC.

[9]  Richard Cole,et al.  Dictionary matching and indexing with errors and don't cares , 2004, STOC '04.

[10]  Rodrigo González,et al.  Compressed Text Indexes with Fast Locate , 2007, CPM.

[11]  Justin Zobel,et al.  Optimized Relative Lempel-Ziv Compression of Genomes , 2011, ACSC.

[12]  Paolo Ferragina,et al.  A simple storage scheme for strings achieving entropy bounds , 2007, SODA '07.

[13]  Justin Zobel,et al.  Relative Lempel-Ziv Factorization for Efficient Storage and Retrieval of Web Collections , 2011, Proc. VLDB Endow..

[14]  Kurt Keutzer,et al.  A text-compression-based method for code size minimization in embedded systems , 1999, TODE.

[15]  Bjarne Stroustrup,et al.  C++ Programming Language , 1986, IEEE Softw..

[16]  James A. Storer,et al.  Data compression via textual substitution , 1982, JACM.

[17]  Gonzalo Navarro,et al.  Succinct Representation of Sequences , 2008 .

[18]  Robert E. Tarjan,et al.  Fast Algorithms for Finding Nearest Common Ancestors , 1984, SIAM J. Comput..

[19]  Kunihiko Sadakane,et al.  CRAM: Compressed Random Access Memory , 2010, ICALP.

[20]  Peter van Emde Boas,et al.  Design and implementation of an efficient priority queue , 1976, Mathematical systems theory.

[21]  Mikkel Thorup,et al.  Dynamic Integer Sets with Optimal Rank, Select, and Predecessor Search , 2014, 2014 IEEE 55th Annual Symposium on Foundations of Computer Science.

[22]  Thore Husfeldt,et al.  Lower Bounds for Dynamic Transitive Closure, Planar Point Location, and Parentheses Matching , 1996, SWAT.

[23]  Kunihiko Sadakane,et al.  Compressed random access memory , 2010, ArXiv.

[24]  Michael L. Fredman,et al.  Surpassing the Information Theoretic Bound with Fusion Trees , 1993, J. Comput. Syst. Sci..

[25]  Peter M. Fenwick,et al.  A new data structure for cumulative frequency tables , 1994, Softw. Pract. Exp..

[26]  Bjarne Stroustrup,et al.  The C++ Programming Language: Special Edition , 2000 .

[27]  C. SIAMJ.,et al.  NEW LOWER BOUND TECHNIQUES FOR DYNAMIC PARTIAL SUMS AND RELATED PROBLEMS , 2003 .

[28]  Dan E. Willard Examining Computational Geometry, Van Emde Boas Trees, and Hashing from the Perspective of the Fusion Tree , 2000, SIAM J. Comput..

[29]  Roberto Grossi,et al.  Squeezing succinct data structures into entropy bounds , 2006, SODA '06.

[30]  Kunihiko Sadakane,et al.  Fast relative Lempel-Ziv self-index for similar sequences , 2014, Theor. Comput. Sci..

[31]  Kurt Keutzer,et al.  Code Optimization Techniques in Embedded DSP Microprocessors , 1998, Des. Autom. Embed. Syst..

[32]  Ronald L. Rivest,et al.  Introduction to Algorithms, Second Edition , 2001 .

[33]  Moshe Lewenstein,et al.  Weighted Ancestors in Suffix Trees , 2014, ESA.

[34]  Kunihiko Sadakane,et al.  Fully Functional Static and Dynamic Succinct Trees , 2009, TALG.

[35]  Philip Bille,et al.  String Indexing for Patterns with Wildcards , 2011, Theory of Computing Systems.

[36]  Gad M. Landau,et al.  Dynamic text and static pattern matching , 2007, TALG.

[37]  Idoia Ochoa,et al.  Reference based genome compression , 2012, 2012 IEEE Information Theory Workshop.

[38]  Erik D. Demaine,et al.  Tight bounds for the partial-sums problem , 2004, SODA '04.

[39]  Stephen Alstrup,et al.  Pattern matching in dynamic texts , 2000, SODA '00.

[40]  Allan Grønlund Jørgensen,et al.  Approximate Range Emptiness in Constant Time and Optimal Space , 2014, SODA.

[41]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[42]  Bjarne Stroustrup,et al.  The C++ programming language - special edition (3. ed.) , 2007 .

[43]  Peter van Emde Boas,et al.  Preserving Order in a Forest in Less Than Logarithmic Time and Linear Space , 1977, Inf. Process. Lett..

[44]  Gonzalo Navarro,et al.  Optimal Dynamic Sequence Representations , 2014, SIAM J. Comput..

[45]  Justin Zobel,et al.  Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval , 2010, SPIRE.

[46]  Johannes Fischer,et al.  Approximating LZ77 via Small-Space Multiple-Pattern Matching , 2015, ESA.

[47]  Sebastiano Vigna,et al.  Fast Prefix Search in Little Space, with Applications , 2010, ESA.

[48]  Meng He,et al.  Indexing Compressed Text , 2003 .