Dynamic Relative Compression

Given a static reference string R and a source string S, a relative compression of S with respect to R is an encoding of S as a sequence of references to substrings of R. Relative compression schemes are a classic model of compression and have recently proved very successful for compressing highly-repetitive massive data set such as genomes and web-data. We initiate the study of relative compression in a dynamic setting where the compressed source string S is subject to edit operations. The goal is to maintain the compressed representation compactly, while supporting edits and allowing efficient random access to the (uncompressed) source string. We present new data structures, that achieve optimal time for updates and queries while using space linear in the size of the optimal relative compression, for nearly all combination of parameters. We also present solution for restricted or extended sets of updates. To achieve these results, we revisit the dynamic partial sums problem and the substring concatenation problem. We present new optimal or near optimal bounds for these problems. Plugging in our new results we also immediately obtain new bounds for the string indexing for patterns with wildcards problem and the dynamic text and static pattern matching problem.

[1]  Paolo Ferragina,et al.  Indexing compressed text , 2005, JACM.

[2]  Justin Zobel,et al.  Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval , 2010, SPIRE.

[3]  Michael E. Saks,et al.  The cell probe complexity of dynamic data structures , 1989, STOC '89.

[4]  Dan E. Willard Examining Computational Geometry, Van Emde Boas Trees, and Hashing from the Perspective of the Fusion Tree , 2000, SIAM J. Comput..

[5]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[6]  James A. Storer,et al.  The macro model for data compression (Extended Abstract) , 1978, STOC '78.

[7]  Kurt Keutzer,et al.  Code Optimization Techniques for Embedded DSP Microprocessors , 1995, 32nd Design Automation Conference.

[8]  Peter van Emde Boas,et al.  Preserving Order in a Forest in Less Than Logarithmic Time and Linear Space , 1977, Inf. Process. Lett..

[9]  Peter M. Fenwick,et al.  A new data structure for cumulative frequency tables , 1994, Softw. Pract. Exp..

[10]  Kunihiko Sadakane,et al.  Fully Functional Static and Dynamic Succinct Trees , 2009, TALG.

[11]  Gonzalo Navarro,et al.  Succinct Representation of Sequences , 2008 .

[12]  C. SIAMJ.,et al.  NEW LOWER BOUND TECHNIQUES FOR DYNAMIC PARTIAL SUMS AND RELATED PROBLEMS , 2003 .

[13]  Gad M. Landau,et al.  Dynamic text and static pattern matching , 2007, TALG.

[14]  Erik D. Demaine,et al.  Tight bounds for the partial-sums problem , 2004, SODA '04.

[15]  Richard Cole,et al.  Dictionary matching and indexing with errors and don't cares , 2004, STOC '04.

[16]  Idoia Ochoa,et al.  Reference based genome compression , 2012, 2012 IEEE Information Theory Workshop.

[17]  Allan Grønlund Jørgensen,et al.  Approximate Range Emptiness in Constant Time and Optimal Space , 2014, SODA.

[18]  Thore Husfeldt,et al.  Lower Bounds for Dynamic Transitive Closure, Planar Point Location, and Parentheses Matching , 1996, SWAT.

[19]  Michael L. Fredman,et al.  Surpassing the Information Theoretic Bound with Fusion Trees , 1993, J. Comput. Syst. Sci..

[20]  Moshe Lewenstein,et al.  Space-Efficient String Indexing for Wildcard Pattern Matching , 2014, STACS.

[21]  Rodrigo González,et al.  Compressed Text Indexes with Fast Locate , 2007, CPM.

[22]  Bjarne Stroustrup,et al.  C++ Programming Language , 1986, IEEE Softw..

[23]  Sebastiano Vigna,et al.  Fast Prefix Search in Little Space, with Applications , 2010, ESA.

[24]  Paolo Ferragina,et al.  A simple storage scheme for strings achieving entropy bounds , 2007, SODA '07.

[25]  Stephen Alstrup,et al.  Pattern matching in dynamic texts , 2000, SODA '00.

[26]  Rajeev Raman,et al.  Succinct Dynamic Data Structures , 2001, WADS.

[27]  Paul F. Dietz Optimal Algorithms for List Indexing and Subset Rank , 1989, WADS.

[28]  Wing-Kai Hon,et al.  Succinct data structures for Searchable Partial Sums with optimal worst-case performance , 2011, Theor. Comput. Sci..

[29]  Roberto Grossi,et al.  Squeezing succinct data structures into entropy bounds , 2006, SODA '06.

[30]  Kurt Keutzer,et al.  Code Optimization Techniques in Embedded DSP Microprocessors , 1998, Des. Autom. Embed. Syst..

[31]  Ronald L. Rivest,et al.  Introduction to Algorithms, Second Edition , 2001 .

[32]  Kurt Keutzer,et al.  A text-compression-based method for code size minimization in embedded systems , 1999, TODE.

[33]  James A. Storer,et al.  Data compression via textual substitution , 1982, JACM.

[34]  Bjarne Stroustrup,et al.  The C++ Programming Language: Special Edition , 2000 .

[35]  Moshe Lewenstein,et al.  Weighted Ancestors in Suffix Trees , 2014, ESA.

[36]  Kurt Mehlhorn,et al.  Bounded Ordered Dictionaries in O(log log N) Time and O(n) Space , 1990, Information Processing Letters.

[37]  Mikkel Thorup,et al.  Dynamic Integer Sets with Optimal Rank, Select, and Predecessor Search , 2014, 2014 IEEE 55th Annual Symposium on Foundations of Computer Science.

[38]  Justin Zobel,et al.  Optimized Relative Lempel-Ziv Compression of Genomes , 2011, ACSC.

[39]  Justin Zobel,et al.  Relative Lempel-Ziv Factorization for Efficient Storage and Retrieval of Web Collections , 2011, Proc. VLDB Endow..

[40]  Philip Bille,et al.  String Indexing for Patterns with Wildcards , 2012, SWAT.

[41]  Gonzalo Navarro,et al.  Optimal Dynamic Sequence Representations , 2013, SODA.

[42]  Peter van Emde Boas,et al.  Design and implementation of an efficient priority queue , 1976, Mathematical systems theory.

[43]  Kunihiko Sadakane,et al.  Fast Relative Lempel-Ziv Self-index for Similar Sequences , 2012, FAW-AAIM.

[44]  Brian W. Kernighan,et al.  The C Programming Language , 1978 .

[45]  Robert E. Tarjan,et al.  Fast Algorithms for Finding Nearest Common Ancestors , 1984, SIAM J. Comput..

[46]  Kunihiko Sadakane,et al.  CRAM: Compressed Random Access Memory , 2010, ICALP.