Antisequential suffix sorting for BWT-based data compression

Suffix sorting requires ordering all suffixes of all symbols in an input sequence and has applications in running queries on large texts and in universal lossless data compression based on the Burrows Wheeler transform (BWT). We propose a new suffix lists data structure that leads to three fast, antisequential, and memory-efficient algorithms for suffix sorting. For a length-N input over a size-|X| alphabet, the worst-case complexities of these algorithms are /spl Theta/(N/sup 2/), O(|X|N log(N/|X|)), and O(N/spl radic/|X|log(N/|X|)), respectively. Furthermore, simulation results indicate performance that is competitive with other suffix sorting methods. In contrast, the suffix sorting methods that are fastest on standard test corpora have poor worst-case performance. Therefore, in comparison with other suffix sorting methods, suffix lists offer a useful trade off between practical performance and worst-case behavior. Another distinguishing feature of suffix lists is that these algorithms are simple; some of them can be implemented in VLSI. This could accelerate suffix sorting by at least an order of magnitude and enable high-speed BWT-based compression systems.

[1]  John G. Cleary,et al.  Unbounded Length Contexts for PPM , 1997 .

[2]  Julian Seward Space-time tradeoffs in the inverse B-W transform , 2001, Proceedings DCC 2001. Data Compression Conference.

[3]  Wing-Kai Hon,et al.  Breaking a time-and-space barrier in constructing full-text indices , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[4]  Yoram Bresler,et al.  Fast parallel algorithms for universal lossless source coding , 2003 .

[5]  Y. Shtarkov,et al.  The context-tree weighting method: basic properties , 1995, IEEE Trans. Inf. Theory.

[6]  Yoram Bresler,et al.  An O(N) semipredictive universal encoder via the BWT , 2004, IEEE Transactions on Information Theory.

[7]  Juha Kärkkäinen,et al.  Fast Lightweight Suffix Array Construction and Checking , 2003, CPM.

[8]  Dong Kyue Kim,et al.  Linear-Time Construction of Suffix Arrays , 2003, CPM.

[9]  M. Crochemore,et al.  On-line construction of suffix trees , 2002 .

[10]  Srinivas Aluru,et al.  Space efficient linear time construction of suffix arrays , 2005, J. Discrete Algorithms.

[11]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[12]  Jorma Rissanen Fast Universal Coding With Context Models , 1999, IEEE Trans. Inf. Theory.

[13]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[14]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[15]  Bernhard Balkenhol,et al.  Universal Data Compression Based on the Burrows-Wheeler Transformation: Theory and Practice , 2000, IEEE Trans. Computers.

[16]  M. Effros PPM performance with BWT complexity: a fast and effective data compression algorithm , 2000, Proceedings of the IEEE.

[17]  Michelle Effros,et al.  Universal lossless source coding with the Burrows Wheeler transform , 1999, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[18]  Peter Elias,et al.  Interval and recency rank source coding: Two on-line adaptive variable-length schemes , 1987, IEEE Trans. Inf. Theory.

[19]  Arnold L. Rosenberg,et al.  Rapid identification of repeated patterns in strings, trees and arrays , 1972, STOC.

[20]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[21]  Yoram Bresler,et al.  Tree Source Identification with the Burrows Wheeler Transform , 2000 .

[22]  Cheng-Wen Wu,et al.  A Low-Power CAM Design for LZ Data Compression , 2000, IEEE Trans. Computers.

[23]  Jean Frédéric Myoupo,et al.  Move-to-front and transpose hybrid parallel architectures for high-speed data compression , 2000, Conference Proceedings of the 2000 IEEE International Performance, Computing, and Communications Conference (Cat. No.00CH37086).

[24]  M. Schindler,et al.  A fast block-sorting algorithm for lossless data compression , 1997, Proceedings DCC '97. Data Compression Conference.

[25]  Nagarajan Ranganathan,et al.  High-speed VLSI designs for Lempel-Ziv-based data compression , 1993 .

[26]  Kunihiko Sadakane,et al.  Faster suffix sorting , 2007, Theoretical Computer Science.

[27]  M. Nelson Data compression with the Burrows-Wheeler Transform , 1996 .

[28]  Giovanni Manzini,et al.  Engineering a Lightweight Suffix Array Construction Algorithm , 2004, Algorithmica.

[29]  S. Jones,et al.  100 Mbit/s adaptive data compressor design using selectively shiftable content-addressable memory , 1992 .

[30]  Julian Seward On the performance of BWT sorting algorithms , 2000, Proceedings DCC 2000. Data Compression Conference.

[31]  Peter Sanders,et al.  Simple Linear Work Suffix Array Construction , 2003, ICALP.

[32]  Robert Sedgewick,et al.  Fast algorithms for sorting and searching strings , 1997, SODA '97.

[33]  N. Jesper Larsson Extended application of suffix trees to data compression , 1996, Proceedings of Data Compression Conference - DCC '96.

[34]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[35]  Stefan Kurtz,et al.  Reducing the space requirement of suffix trees , 1999 .

[36]  S. Jones Partial-matching lossless data compression hardware , 2000 .