LZ77-Like Compression with Fast Random Access

We introduce an alternative Lempel-Ziv text parsing, LZ-End, that converges to the entropy and in practice gets very close to LZ77. LZ-End forces sources to finish at the end of a previous phrase. Most Lempel-Ziv parsings can decompress the text only from the beginning. LZ-End is the only parsing we know of able of decompressing arbitrary phrases in optimal time, while staying closely competitive with LZ77, especially on highly repetitive collections, where LZ77 excells. Thus LZ-End is ideal as a compression format for highly repetitive sequence databases, where access to individual sequences is required, and it also opens the door to compressed indexing schemes for such collections.

[1]  Ricardo A. Baeza-Yates,et al.  Compression: A Key for Next-Generation Text Retrieval Systems , 2000, Computer.

[2]  Edward R. Fiala,et al.  Data compression with finite windows , 1989, CACM.

[3]  Rajeev Raman,et al.  Succinct Representations of Permutations , 2003, ICALP.

[4]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[5]  Paolo Ferragina,et al.  Indexing compressed text , 2005, JACM.

[6]  Gonzalo Navarro,et al.  Self-indexed Text Compression Using Straight-Line Programs , 2009, MFCS.

[7]  Mohammad Banikazemi LZB: Data Compression with Bounded References , 2009, 2009 Data Compression Conference.

[8]  Gang Chen,et al.  Lempel–Ziv Factorization Using Less Time & Space , 2008, Math. Comput. Sci..

[9]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[10]  Alistair Moffat,et al.  Off-line dictionary-based compression , 1999, Proceedings of the IEEE.

[11]  Gonzalo Navarro,et al.  Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections , 2008, SPIRE.

[12]  Rajeev Raman,et al.  Succinct indexable dictionaries with applications to encoding k-ary trees and multisets , 2002, SODA '02.

[13]  William F. Smyth,et al.  The maximum number of of runs in a string , 2003, IWOCA 2007.

[14]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[15]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[16]  Volker Heun,et al.  A New Succinct Representation of RMQ-Information and Improvements in the Enhanced Suffix Array , 2007, ESCAPE.

[17]  Gonzalo Navarro,et al.  Implementing the LZ-index: Theory versus practice , 2009, JEAL.

[18]  Giovanni Manzini,et al.  Indexing compressed text , 2005, JACM.

[19]  Giovanni Manzini,et al.  Compression of low entropy strings with Lempel-Ziv algorithms , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).