A Framework of Dynamic Data Structures for String Processing

In this paper we present DYNAMIC, an open-source C++ library implementing dynamic compressed data structures for string manipulation. Our framework includes useful tools such as searchable partial sums, succinct/gap-encoded bitvectors, and entropy/run-length compressed strings and FM-indexes. We prove close-to-optimal theoretical bounds for the resources used by our structures, and show that our theoretical predictions are empirically tightly verified in practice. To conclude, we turn our attention to applications. We compare the performance of four recently-published compression algorithms implemented using DYNAMIC with those of state-of-the-art tools performing the same task. Our experiments show that algorithms making use of dynamic compressed data structures can be up to three orders of magnitude more space-efficient (albeit slower) than classical ones performing the same tasks.

[1]  Simon J. Puglisi,et al.  Lempel-Ziv factorization: Simple, fast, practical , 2013, ALENEX.

[2]  Rajeev Raman,et al.  Succinct Dynamic Data Structures , 2001, WADS.

[3]  Rajeev Raman,et al.  Dynamic Compressed Strings with Random Access , 2013, ICALP.

[4]  Alberto Policriti,et al.  Computing LZ77 in Run-Compressed Space , 2015, 2016 Data Compression Conference (DCC).

[5]  Enno Ohlebusch,et al.  Space-Efficient Construction of the Burrows-Wheeler Transform , 2013, SPIRE.

[6]  Rajeev Raman,et al.  Improved Practical Compact Dynamic Tries , 2015, SPIRE.

[7]  Wing-Kai Hon,et al.  Compressed indexes for dynamic text collections , 2007, TALG.

[8]  Guy E. Blelloch,et al.  Compact representations of ordered sets , 2004, SODA '04.

[9]  Juha Kärkkäinen,et al.  Linear Time Lempel-Ziv Factorization: Simple, Fast, Small , 2012, CPM.

[10]  Rajeev Raman,et al.  Compact Dynamic Rewritable (CDRW) Arrays , 2017, ALENEX.

[11]  Gonzalo Navarro,et al.  Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections , 2008, SPIRE.

[12]  Alberto Policriti,et al.  Average Linear Time and Compressed Space Construction of the Burrows-Wheeler Transform , 2015, LATA.

[13]  Juha Kärkkäinen,et al.  Lightweight Lempel-Ziv Parsing , 2013, SEA.

[14]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[15]  Patrick K. Nicholson,et al.  A General Framework for Dynamic Succinct and Compressed Data Structures , 2016, ALENEX.

[16]  Gonzalo Navarro,et al.  Optimal Dynamic Sequence Representations , 2014, SIAM J. Comput..

[17]  Alistair Moffat,et al.  From Theory to Practice: Plug and Play with Succinct Data Structures , 2013, SEA.

[18]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[19]  Gonzalo Navarro,et al.  Practical Dynamic Entropy-Compressed Bitvectors with Applications , 2016, SEA.

[20]  Gonzalo Navarro,et al.  Dynamic entropy-compressed sequences and full-text indexes , 2006, TALG.

[21]  Gonzalo Navarro,et al.  Storage and Retrieval of Highly Repetitive Sequence Collections , 2010, J. Comput. Biol..

[22]  Travis Gagie,et al.  Lightweight Data Indexing and Compression in External Memory , 2009, Algorithmica.

[23]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[24]  Alberto Policriti,et al.  Fast Online Lempel-Ziv Factorization in Compressed Space , 2015, SPIRE.