Revisiting bounded context block‐sorting transformations

The Burrows–Wheeler Transform (BWT) produces a permutation of a string X, denoted X∗, by sorting the n cyclic rotations of X into full lexicographical order and taking the last column of the resulting n×n matrix to be X∗. The transformation is reversible in O(n) time. In this paper, we consider an alteration to the process, called k‐BWT, where rotations are only sorted to a depth k. We propose new approaches to the forward and reverse transform, and show that the methods are efficient in practice. More than a decade ago, two algorithms were independently discovered for reversing k‐BWT, both of which run in O(nk) time. Two recent algorithms have lowered the bounds for the reverse transformation to O(nlogk) and O(n) , respectively. We examine the practical performance for these reversal algorithms. We find that the original O(nk) approach is most efficient in practice, and investigates new approaches, aimed at further speeding reversal, which store precomputed context boundaries in the compressed file. By explicitly encoding the context boundaries, we present an O(n) reversal technique that is both efficient and effective. Finally, our study elucidates an inherently cache‐friendly – and hitherto unobserved – behavior in the reverse k‐BWT, which could lead to new applications of the k‐BWT transform. In contrast to previous empirical studies, we show that the partial transform can be reversed significantly faster than the full transform, without significantly affecting compression effectiveness. Copyright © 2011 John Wiley & Sons, Ltd.

[1]  Giovanni Manzini,et al.  An analysis of the Burrows-Wheeler transform , 2001, SODA '99.

[2]  William F. Smyth,et al.  A taxonomy of suffix array construction algorithms , 2007, CSUR.

[3]  Jean-Paul Delahaye,et al.  A guaranteed compression scheme for repetitive DNA sequences , 1996, Proceedings of Data Compression Conference - DCC '96.

[4]  Meng He,et al.  Indexing Compressed Text , 2003 .

[5]  Hidetoshi Yokoo Notes on Block-Sorting Data Compression , 1999 .

[6]  Stefano Lonardi,et al.  Compression of biological sequences by greedy off-line textual substitution , 2000, Proceedings DCC 2000. Data Compression Conference.

[7]  Raffaele Giancarlo,et al.  The Engineering of a Compression Boosting Library: Theory vs Practice in BWT Compression , 2006, ESA.

[8]  David Salomon,et al.  Data compression - The Complete Reference, 4th Edition , 2004 .

[9]  Bernhard Balkenhol,et al.  Universal Data Compression Based on the Burrows-Wheeler Transformation: Theory and Practice , 2000, IEEE Trans. Computers.

[10]  Paolo Ferragina,et al.  Text Compression , 2009, Encyclopedia of Database Systems.

[11]  Peter Sanders,et al.  Linear work suffix array construction , 2006, JACM.

[12]  Juha Kärkkäinen,et al.  Medium-Space Algorithms for Inverse BWT , 2010, ESA.

[13]  P. Fenwick,et al.  Block Sorting Text Compression -- Final Report , 1996 .

[14]  J. Shane Culpepper,et al.  Enhanced Byte Codes with Restricted Prefix Properties , 2005, SPIRE.

[15]  Robert E. Tarjan,et al.  Amortized efficiency of list update and paging rules , 1985, CACM.

[16]  Simon J. Puglisi,et al.  An efficient, versatile approach to suffix sorting , 2008, JEAL.

[17]  Sen Zhang,et al.  Computing Inverse ST in Linear Complexity , 2008, CPM.

[18]  Pierre Baldi,et al.  Data structures and compression algorithms for genomic sequence data , 2009, Bioinform..

[19]  M. Schindler,et al.  A fast block-sorting algorithm for lossless data compression , 1997, Proceedings DCC '97. Data Compression Conference.

[20]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[21]  Julian Seward Space-time tradeoffs in the inverse B-W transform , 2001, Proceedings DCC 2001. Data Compression Conference.

[22]  Amar Mukherjee,et al.  The Burrows-Wheeler Transform:: Data Compression, Suffix Arrays, and Pattern Matching , 2008 .

[23]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[24]  Sebastian Deorowicz,et al.  Second step algorithms in the Burrows–Wheeler compression algorithm , 2002, Softw. Pract. Exp..

[25]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[26]  Juha Kärkkäinen,et al.  Fast BWT in small space by blockwise suffix sorting , 2007, Theor. Comput. Sci..

[27]  R. Nigel Horspool,et al.  Constructing word-based text compression algorithms , 1992, Data Compression Conference, 1992..

[28]  Michelle Effros,et al.  Universal lossless source coding with the Burrows Wheeler transform , 1999, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[29]  Hidetoshi Yokoo,et al.  Novel and Generalized Sort-Based Transform for Lossless Data Compression , 2009, SPIRE.

[30]  Hidetoshi Yokoo Extension and Faster Implementation of the GRP Transform for Lossless Compression , 2010, CPM.

[31]  Rodrigo González,et al.  Compressed text indexes: From theory to practice , 2007, JEAL.

[32]  Sen Zhang,et al.  Efficient Algorithms for the Inverse Sort Transform , 2007, IEEE Transactions on Computers.

[33]  David Salomon,et al.  Data Compression: The Complete Reference , 2006 .

[34]  Ian H. Witten,et al.  Arithmetic coding for data compression , 1987, CACM.

[35]  Gurmeet Singh Manku,et al.  RadixZip: Linear-Time Compression of Token Streams , 2007, VLDB.

[36]  Shoshana Neuburger,et al.  The Burrows-Wheeler transform: data compression, suffix arrays, and pattern matching by Donald Adjeroh, Timothy Bell and Amar Mukherjee Springer, 2008 , 2010 .

[37]  Michael Schindler,et al.  A fast renormalisation for arithmetic coding , 1998, Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225).

[38]  Juha Kärkkäinen,et al.  Engineering Radix Sort for Strings , 2008, SPIRE.

[39]  Michelle Effros PPM performance with BWT complexity: a new method for lossless data compression , 2000, Proceedings DCC 2000. Data Compression Conference.

[40]  Kunihiko Sadakane,et al.  A Linear-Time Burrows-Wheeler Transform Using Induced Sorting , 2009, SPIRE.

[41]  Raffaele Giancarlo,et al.  Boosting textual compression in optimal linear time , 2005, JACM.

[42]  Ulrich Lauther,et al.  Space Efficient Algorithms for the Burrows-Wheeler Backtransformation , 2005, Algorithmica.

[43]  Alistair Moffat,et al.  Compression and Coding Algorithms , 2005, IEEE Trans. Inf. Theory.

[44]  Ian H. Witten,et al.  Arithmetic coding revisited , 1998, TOIS.

[45]  Julian Seward On the performance of BWT sorting algorithms , 2000, Proceedings DCC 2000. Data Compression Conference.

[46]  Hozumi Tanaka,et al.  An efficient method for in memory construction of suffix arrays , 1999, 6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268).

[47]  Xin Chen,et al.  A compression algorithm for DNA sequences and its applications in genome comparison , 2000, RECOMB '00.