A four-stage algorithm for updating a Burrows-Wheeler transform

We present a four-stage algorithm that updates the Burrows-Wheeler Transform of a text T, when this text is modified. The Burrows-Wheeler Transform is used by many text compression applications and some self-index data structures. It operates by reordering the letters of a text T to obtain a new text bwt(T) which can be better compressed. Even though recent advances are offering this structure new applications, a major bottleneck still exists: bwt(T) has to be entirely reconstructed from scratch whenever T is modified. We study how standard edit operations (insertion, deletion, substitution of a letter or a factor) that transform a text T into T^' impact bwt(T). Then we present an algorithm that directly converts bwt(T) into bwt(T^'). Based on this algorithm, we also sketch a method for converting the suffix array of T into the suffix array of T^'. We finally show, based on the experiments we conducted, that this algorithm, whose worst-case time complexity is O(|T|log|T|([email protected]/loglog|T|)), performs really well in practice and replaces advantageously the traditional approach.

[1]  S. Muthukrishnan,et al.  Theoretical Computer Science: Special Issue on the Burrows-Wheeler Transform , 2007 .

[2]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[3]  Rodrigo González,et al.  Improved Dynamic Rank-Select Entropy-Bound Structures , 2008, LATIN.

[4]  Wolfgang Gerlach Dynamic FM-Index for a Collection of Texts with Application to Space-ecient Construction of the , 2007 .

[5]  R. Bird Pearls of Functional Algorithm Design: The Burrows–Wheeler transform , 2010 .

[6]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[7]  Juha Kärkkäinen,et al.  Fast BWT in small space by blockwise suffix sorting , 2007, Theor. Comput. Sci..

[8]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..

[9]  Gonzalo Navarro,et al.  Compressed representations of sequences and full-text indexes , 2007, TALG.

[10]  Simon J. Puglisi,et al.  Faster Lightweight Suffix Array Construction , .

[11]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[12]  Johann van der Merwe,et al.  A survey on peer-to-peer key management for mobile ad hoc networks , 2007, CSUR.

[13]  John G. Cleary,et al.  Unbounded Length Contexts for PPM , 1997 .

[14]  Gaston H. Gonnet,et al.  New Indices for Text: Pat Trees and Pat Arrays , 1992, Information Retrieval: Data Structures & Algorithms.

[15]  Gonzalo Navarro,et al.  Dynamic entropy-compressed sequences and full-text indexes , 2006, TALG.

[16]  William F. Smyth,et al.  A taxonomy of suffix array construction algorithms , 2007, CSUR.

[17]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[18]  Maxime Crochemore,et al.  A note on the Burrows-Wheeler transformation , 2005, ArXiv.