Most Burrows-Wheeler Based Compressors Are Not Optimal

We present a technique for proving lower bounds on the compression ratio of algorithms which are based on the Burrows-Wheeler Transform (BWT). We study three well known BWT-based compressors: the original algorithm suggested by Burrows and Wheeler; BWT with distance coding; and BWT with run-length encoding. For each compressor, we show a Markov source such that for asymptotically-large text generated by the source, the compression ratio divided by the entropy of the source is a constant greater than 1. This constant is 2 - e, 1.26, and 1.29, for each of the three compressors respectively. Our technique is robust, and can be used to prove similar claims for most BWT-based compressors (with a few notable exceptions). This stands in contrast to statistical compressors and Lempel-Ziv-style dictionary compressors, which are long known to be optimal, in the sense that for any Markov source, the compression ratio divided by the entropy of the source asymptotically tends to 1. We experimentally corroborate our theoretical bounds. Furthermore, we compare BWT-based compressors to other compressors and show that for "realistic" Markov sources they indeed perform bad and often worse than other compressors. This is in contrast with the well known fact that on English text, BWT-based compressors are superior to many other types of compressors.

[1]  Yossi Azar,et al.  Algorithms - ESA 2006, 14th Annual European Symposium, Zurich, Switzerland, September 11-13, 2006, Proceedings , 2006, ESA.

[2]  Leslie G. Valiant,et al.  Fast probabilistic algorithms for hamiltonian circuits and matchings , 1977, STOC '77.

[3]  Serap A. Savari,et al.  Redundancy of the Lempel-Ziv-Welch code , 1997, Proceedings DCC '97. Data Compression Conference.

[4]  Robert G. Gallager,et al.  Variations on a theme by Huffman , 1978, IEEE Trans. Inf. Theory.

[5]  Giovanni Manzini,et al.  An analysis of the Burrows-Wheeler transform , 2001, SODA '99.

[6]  A. D. Wyner,et al.  The sliding-window Lempel-Ziv algorithm is asymptotically optimal , 1994, Proc. IEEE.

[7]  Daniel S. Hirschberg,et al.  Data compression , 1987, CSUR.

[8]  Terry A. Welch,et al.  A Technique for High-Performance Data Compression , 1984, Computer.

[9]  Robert E. Tarjan,et al.  A Locally Adaptive Data , 1986 .

[10]  Haim Kaplan,et al.  A simpler analysis of Burrows-Wheeler-based compression , 2007, Theor. Comput. Sci..

[11]  Giovanni Manzini,et al.  Compression of Low Entropy Strings with Lempel-Ziv Algorithms , 1999, SIAM J. Comput..

[12]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[13]  Ian H. Witten,et al.  Arithmetic coding revisited , 1998, TOIS.

[14]  Raffaele Giancarlo,et al.  Boosting textual compression in optimal linear time , 2005, JACM.

[15]  Ian H. Witten,et al.  Arithmetic coding for data compression , 1987, CACM.

[16]  Raffaele Giancarlo,et al.  The Engineering of a Compression Boosting Library: Theory vs Practice in BWT Compression , 2006, ESA.

[17]  Sebastian Deorowicz,et al.  Second step algorithms in the Burrows–Wheeler compression algorithm , 2002, Softw. Pract. Exp..

[18]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[19]  Sanjeev R. Kulkarni,et al.  Universal lossless source coding with the Burrows Wheeler Transform , 2002, IEEE Trans. Inf. Theory.

[20]  Giovanni Manzini,et al.  Engineering a Lightweight Suffix Array Construction Algorithm , 2004, Algorithmica.