Implementing Efficient Updates in Compressed Big Text Databases

Text compression techniques like bzip2 lack the possibility to insert or to delete strings at a given position into a text that has been compressed without prior decompression of the compressed text. We present a technique called DICIRT that supports fast insertion into and deletion from compressed texts without full decompression of the compressed text. For inserted fragments up to a size of 8% of the original text size, and for deleted fragments up to 15% of the original text DICIRT is faster than modifying uncompressed text preceded by a decompression step and followed by a compression step.

[1]  Peter Buneman,et al.  Edinburgh Research Explorer Path Queries on Compressed XML , 2022 .

[2]  Ian H. Witten,et al.  Arithmetic coding for data compression , 1987, CACM.

[3]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[4]  Laurent Mouchard,et al.  On the number of elements to reorder when updating a suffix array , 2012, J. Discrete Algorithms.

[5]  Laurent Mouchard,et al.  A four-stage algorithm for updating a Burrows-Wheeler transform , 2009, Theor. Comput. Sci..

[6]  Kunihiko Sadakane,et al.  Succinct data structures for flexible text retrieval systems , 2007, J. Discrete Algorithms.

[7]  Terry A. Welch,et al.  A Technique for High-Performance Data Compression , 1984, Computer.

[8]  Thomas Jacobs,et al.  Fast Multi-update Operations on Compressed XML Data , 2013, BNCOD.

[9]  Stefan Böttcher,et al.  Search and Modification in Compressed Texts , 2011, 2011 Data Compression Conference.

[10]  Gad M. Landau,et al.  Random access to grammar-compressed strings , 2010, SODA '11.

[11]  R. Nigel Horspool,et al.  Data Compression Using Dynamic Markov Modelling , 1987, Comput. J..

[12]  Shmuel Tomi Klein,et al.  Robust Universal Complete Codes for Transmission and Compression , 1996, Discret. Appl. Math..

[13]  D. Huffman A Method for the Construction of Minimum-Redundancy Codes , 1952 .

[14]  Ian H. Witten,et al.  Identifying Hierarchical Structure in Sequences: A linear-time algorithm , 1997, J. Artif. Intell. Res..

[15]  Gonzalo Navarro,et al.  Succinct Suffix Arrays based on Run-Length Encoding , 2005, Nord. J. Comput..

[16]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[17]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[18]  Giovanni Manzini,et al.  Indexing compressed text , 2005, JACM.

[19]  M. Tamer Özsu,et al.  A succinct physical storage scheme for efficient evaluation of path queries in XML , 2004, Proceedings. 20th International Conference on Data Engineering.

[20]  Solomon W. Golomb,et al.  Run-length encodings (Corresp.) , 1966, IEEE Trans. Inf. Theory.

[21]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[22]  Wing-Kai Hon,et al.  Compressed indexes for dynamic text collections , 2007, TALG.

[23]  Giovanni Manzini,et al.  An experimental study of an opportunistic index , 2001, SODA '01.

[24]  Gonzalo Navarro,et al.  LZ77-Like Compression with Fast Random Access , 2010, 2010 Data Compression Conference.

[25]  S. Golomb Run-length encodings. , 1966 .

[26]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..