A New Class of Searchable and Provably Highly Compressible String Transformations

The Burrows-Wheeler Transform is a string transformation that plays a fundamental role for the design of self-indexing compressed data structures. Over the years, researchers have successfully extended this transformation outside the domains of strings. However, efforts to find non-trivial alternatives of the original, now 25 years old, Burrows-Wheeler string transformation have met limited success. In this paper we bring new lymph to this area by introducing a whole new family of transformations that have all the myriad virtues of the BWT: they can be computed and inverted in linear time, they produce provably highly compressible strings, and they support linear time pattern search directly on the transformed string. This new family is a special case of a more general class of transformations based on context adaptive alphabet orderings, a concept introduced here. This more general class includes also the Alternating BWT, another invertible string transforms recently introduced in connection with a generalization of Lyndon words.

[1]  Giovanni Manzini,et al.  An analysis of the Burrows-Wheeler transform , 2001, SODA '99.

[2]  Antonio Restivo,et al.  Burrows-Wheeler Transform and Run-Length Enconding , 2017, WORDS.

[3]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[4]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[5]  Giovanna Rosone,et al.  Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform , 2012, Bioinform..

[6]  Gonzalo Navarro,et al.  Storage and Retrieval of Highly Repetitive Sequence Collections , 2010, J. Comput. Biol..

[7]  Arnaud Lefebvre,et al.  A survey of string orderings and their application to the Burrows-Wheeler transform , 2017, Theor. Comput. Sci..

[8]  Travis Gagie,et al.  Wheeler graphs: A framework for BWT-based data structures☆ , 2017, Theor. Comput. Sci..

[9]  Ira M. Gessel,et al.  Counting Permutations with Given Cycle Structure and Descent Set , 1993, J. Comb. Theory A.

[10]  M. Schindler,et al.  A fast block-sorting algorithm for lossless data compression , 1997, Proceedings DCC '97. Data Compression Conference.

[11]  Antonio Restivo,et al.  A bijection between words and multisets of necklaces , 2012, Eur. J. Comb..

[12]  Raffaele Giancarlo,et al.  Boosting textual compression in optimal linear time , 2005, JACM.

[13]  Maxime Crochemore,et al.  A note on the Burrows-Wheeler transformation , 2005, ArXiv.

[14]  Dominik Kempa,et al.  At the roots of dictionary compression: string attractors , 2017, STOC.

[15]  Antonio Restivo,et al.  Block Sorting-Based Transformations on Words: Beyond the Magic BWT , 2018, DLT.

[16]  Gonzalo Navarro,et al.  Optimal-Time Text Indexing in BWT-runs Bounded Space , 2017, SODA.

[17]  Gonzalo Navarro,et al.  On compressing and indexing repetitive sequences , 2013, Theor. Comput. Sci..

[18]  Gonzalo Navarro,et al.  Compact Data Structures - A Practical Approach , 2016 .

[19]  Gonzalo Navarro,et al.  Optimal Lower and Upper Bounds for Representing Sequences , 2011, TALG.

[20]  J. Shane Culpepper,et al.  Revisiting bounded context block‐sorting transformations , 2012, Softw. Pract. Exp..

[21]  Antonio Restivo,et al.  Measuring the clustering effect of BWT via RLE , 2017, Theor. Comput. Sci..

[22]  Stephen R. Tate,et al.  Higher compression from the Burrows-Wheeler transform by modified sorting , 1998, Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225).

[23]  Paolo Ferragina,et al.  Indexing compressed text , 2005, JACM.

[24]  Antonio Restivo,et al.  From first principles to the Burrows and Wheeler transform and beyond, via combinatorial optimization , 2007, Theor. Comput. Sci..

[25]  Peter Sanders,et al.  Linear work suffix array construction , 2006, JACM.

[26]  J. Shane Culpepper,et al.  Backwards Search in Context Bound Text Transformations , 2011, 2011 First International Conference on Data Compression, Communications and Processing.

[27]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .