A Faster Implementation of Online Run-Length Burrows-Wheeler Transform

Run-length encoding Burrows-Wheeler Transformed strings, resulting in Run-Length BWT (RLBWT), is a powerful tool for processing highly repetitive strings. We propose a new algorithm for online RLBWT working in run-compressed space, which runs in \(O(n\lg r)\) time and \(O(r\lg n)\) bits of space, where n is the length of input string S received so far and r is the number of runs in the BWT of the reversed S. We improve the state-of-the-art algorithm for online RLBWT in terms of empirical construction time. Adopting the dynamic list for maintaining a total order, we can replace rank queries in a dynamic wavelet tree on a run-length compressed string by the direct comparison of labels in a dynamic list. The empirical result for various benchmarks show the efficiency of our algorithm, especially for highly repetitive strings.

[1]  Enno Ohlebusch,et al.  Lempel-Ziv Factorization Revisited , 2011, CPM.

[2]  Takuya Kida,et al.  Online Grammar Transformation Based on Re-Pair Algorithm , 2016, 2016 Data Compression Conference (DCC).

[3]  Gonzalo Navarro,et al.  Optimal Dynamic Sequence Representations , 2013, SODA.

[4]  Gonzalo Navarro,et al.  Storage and Retrieval of Highly Repetitive Sequence Collections , 2010, J. Comput. Biol..

[5]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[6]  Alberto Policriti,et al.  Fast Online Lempel-Ziv Factorization in Compressed Space , 2015, SPIRE.

[7]  Hideo Bannai,et al.  LZD Factorization: Simple and Practical Online Grammar Compression with Variable-to-Fixed Encoding , 2015, CPM.

[8]  Jouni Sirén,et al.  Compressed Full-Text Indexes for Highly Repetitive Collections , 2012 .

[9]  J. Ian Munro,et al.  Compressed Data Structures for Dynamic Sequences , 2015, ESA.

[10]  Gonzalo Navarro,et al.  Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections , 2008, SPIRE.

[11]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[12]  Hideo Bannai,et al.  Faster Compact On-Line Lempel-Ziv Factorization , 2014, STACS.

[13]  Gonzalo Navarro,et al.  Space-Efficient Construction of Compressed Indexes in Deterministic Linear Time , 2016, SODA.

[14]  Hiroshi Sakamoto,et al.  Fully-Online Grammar Compression , 2013, SPIRE.

[15]  Amar Mukherjee,et al.  The Burrows-Wheeler Transform:: Data Compression, Suffix Arrays, and Pattern Matching , 2008 .

[16]  Richard Cole,et al.  Two Simplified Algorithms for Maintaining Order in a List , 2002, ESA.

[17]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[18]  Kunihiko Sadakane,et al.  Succinct de Bruijn Graphs , 2012, WABI.

[19]  Gonzalo Navarro,et al.  Self-Index based on LZ77 (thesis) , 2011, ArXiv.

[20]  Kunihiko Sadakane,et al.  Fully Functional Static and Dynamic Succinct Trees , 2009, TALG.

[21]  Tatiana Starikovskaya Computing Lempel-Ziv Factorization Online , 2012, MFCS.

[22]  Mathieu Raffinot,et al.  Composite Repetition-Aware Data Structures , 2015, CPM.

[23]  Philip Bille,et al.  Dynamic Relative Compression, Dynamic Partial Sums, and Substring Concatenation , 2017, Algorithmica.

[24]  Fabrizio Luccio,et al.  Structuring labeled trees for optimal succinctness, and beyond , 2005, 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS'05).

[25]  Wing-Kai Hon,et al.  Succinct data structures for Searchable Partial Sums with optimal worst-case performance , 2011, Theor. Comput. Sci..

[26]  Alberto Policriti,et al.  Computing LZ77 in Run-Compressed Space , 2015, 2016 Data Compression Conference (DCC).

[27]  Philip Bille,et al.  Space-Efficient Re-Pair Compression , 2017, 2017 Data Compression Conference (DCC).

[28]  Nicola Prezza,et al.  A Framework of Dynamic Data Structures for String Processing , 2017, SEA.