Measuring the clustering effect of BWT via RLE

Abstract The Burrows–Wheeler Transform (BWT) is a reversible transformation on which are based several text compressors and many other tools used in Bioinformatics and Computational Biology. The BWT is not actually a compressor, but a transformation that performs a context-dependent permutation of the letters of the input text that often create runs of equal letters (clusters) longer than the ones in the original text, usually referred to as the “clustering effect” of BWT. In particular, from a combinatorial point of view, great attention has been given to the case in which the BWT produces the fewest number of clusters (cf. [5] , [16] , [21] , [23] ). In this paper we are concerned about the cases when the clustering effect of the BWT is not achieved. For this purpose we introduce a complexity measure that counts the number of equal-letter runs of a word. This measure highlights that there exist many words for which BWT gives an “un-clustering effect”, that is BWT produce a great number of short clusters. More in general we show that the application of BWT to any word at worst doubles the number of equal-letter runs. Moreover, we prove that this bound is tight by exhibiting some families of words where such upper bound is always reached. We also prove that for binary words the case in which the BWT produces the maximal number of clusters is related to the very well known Artin's conjecture on primitive roots. The study of some combinatorial properties underlying this transformation could be useful for improving indexing and compression strategies.

[1]  Antonio Restivo,et al.  Burrows-Wheeler transform and Sturmian words , 2003, Inf. Process. Lett..

[2]  Giovanni Manzini,et al.  An analysis of the Burrows-Wheeler transform , 2001, SODA '99.

[3]  M. Bringer Sur un problème de R. Queneau , 1969 .

[4]  Antonio Restivo,et al.  Burrows-Wheeler transform and palindromic richness , 2009, Theor. Comput. Sci..

[5]  Simon J. Puglisi,et al.  Words with Simple Burrows-Wheeler Transforms , 2008, Electron. J. Comb..

[6]  H. Fredricksen A Survey of Full Length Nonlinear Shift Register Cycle Algorithms , 1982 .

[7]  M. Lothaire Algebraic Combinatorics on Words , 2002 .

[8]  Luca Q. Zamboni,et al.  Clustering Words and Interval Exchanges , 2013 .

[9]  Gonzalo Navarro,et al.  Compressed Compact Suffix Arrays , 2004, CPM.

[10]  Peter R. J. Asveld,et al.  Permuting operations on strings and their relation to prime numbers , 2010, Discret. Appl. Math..

[11]  Gonzalo Navarro,et al.  Storage and Retrieval of Highly Repetitive Sequence Collections , 2010, J. Comput. Biol..

[12]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[13]  Antonio Restivo,et al.  Balancing and clustering of words in the Burrows-Wheeler transform , 2011, Theor. Comput. Sci..

[14]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[15]  Haim Kaplan,et al.  A Simpler Analysis of Burrows-Wheeler Based Compression , 2006, CPM.

[16]  Samir Guglani,et al.  LINGUISTIC COMPETENCE AND PSYCHOPATHOLOGY : A CROSS-CULTURAL MODEL , 1982, Indian journal of psychiatry.

[17]  Jouni Sirén,et al.  Compressed Full-Text Indexes for Highly Repetitive Collections , 2012 .

[18]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[19]  Haim Kaplan,et al.  Most Burrows-Wheeler Based Compressors Are Not Optimal , 2007, CPM.

[20]  Peter M. Higgins Burrows-Wheeler transformations and de Bruijn words , 2012, Theor. Comput. Sci..

[21]  Robert E. Tarjan,et al.  A Locally Adaptive Data , 1986 .

[22]  C. Hooley On Artin's conjecture. , 1967 .

[23]  Raffaele Giancarlo,et al.  Boosting textual compression in optimal linear time , 2005, JACM.