Move-to-Front, Distance Coding, and Inversion Frequencies Revisited

Move-to-Front, Distance Coding and Inversion Frequencies are three somewhat related techniques used to process the output of the Burrows-Wheeler Transform. In this paper we analyze these techniques from the point of view of how effective they are in the task of compressing low-entropy strings, that is, strings which have many regularities and are therefore highly compressible. This is a non-trivial task since many compressors have non-constant overheads that become non-negligible when the input string is highly compressible. Because of the properties of the Burrows-Wheeler transform, being locally optimal ensures an algorithm compresses low-entropy strings effectively. Informally, local optimality implies that an algorithm is able to effectively compress an arbitrary partition of the input string. We show that in their original formulation neither Move-to-Front, nor Distance Coding, nor Inversion Frequencies is locally optimal. Then, we describe simple variants of the above algorithms which are locally optimal. To achieve local optimality with Move-to-Front it suffices to combine it with Run Length Encoding. To achieve local optimality with Distance Coding and Inversion Frequencies we use a novel "escape and re-enter" strategy. Since we build on previous results, our analyses are simple and shed new light on the inner workings of the three techniques considered in this paper.

[1]  Raffaele Giancarlo,et al.  The myriad virtues of Wavelet Trees , 2009, Inf. Comput..

[2]  Haim Kaplan,et al.  A Simpler Analysis of Burrows-Wheeler Based Compression , 2006, CPM.

[3]  Gonzalo Navarro,et al.  Implicit Compression Boosting with Applications to Self-indexing , 2007, SPIRE.

[4]  Raffaele Giancarlo,et al.  The Engineering of a Compression Boosting Library: Theory vs Practice in BWT Compression , 2006, ESA.

[5]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[6]  Spyros S. Magliveras,et al.  Block sorting and compression , 1997, Proceedings DCC '97. Data Compression Conference.

[7]  Robert E. Tarjan,et al.  A Locally Adaptive Data , 1986 .

[8]  Giovanni Manzini,et al.  An analysis of the Burrows-Wheeler transform , 2001, SODA '99.

[9]  Roberto Grossi,et al.  Fast compression with a static model in high-order entropy , 2004, Data Compression Conference, 2004. Proceedings. DCC 2004.

[10]  Roberto Grossi,et al.  When indexing equals compression: experiments with compressing suffix arrays and applications , 2004, SODA '04.

[11]  Gonzalo Navarro,et al.  Succinct Suffix Arrays based on Run-Length Encoding , 2005, Nord. J. Comput..

[12]  Yunwei Jia,et al.  Universal lossless coding of sources with large and unbounded alphabets , 2000, 2000 IEEE International Symposium on Information Theory (Cat. No.00CH37060).

[13]  Jürgen Abel,et al.  Incremental frequency count—a post BWT-stage for the Burrows–Wheeler compression algorithm , 2007 .

[14]  Ziya Arnavut Inversion Coding , 2004, Comput. J..

[15]  Raffaele Giancarlo,et al.  Boosting textual compression in optimal linear time , 2005, JACM.

[16]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[17]  Sebastian Deorowicz,et al.  Second step algorithms in the Burrows–Wheeler compression algorithm , 2002, Softw. Pract. Exp..

[18]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[19]  Roberto Grossi,et al.  Nearly Tight Bounds on the Encoding Length of the Burrows-Wheeler Transform , 2008, ANALCO.

[20]  Peter M. Fenwick Burrows–Wheeler compression with variable length integer codes , 2002, Softw. Pract. Exp..

[21]  Bernhard Balkenhol,et al.  Modifications of the Burrows and Wheeler data compression algorithm , 1999, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[22]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[23]  Raffaele Giancarlo,et al.  Optimal Partitions of Strings: A New Class of Burrows-Wheeler Compression Algorithms , 2003, CPM.

[24]  Peter Elias,et al.  Universal codeword sets and representations of the integers , 1975, IEEE Trans. Inf. Theory.

[25]  Haim Kaplan,et al.  Most Burrows-Wheeler Based Compressors Are Not Optimal , 2007, CPM.