Online LZ77 Parsing and Matching Statistics with RLBWTs

Lempel-Ziv 1977 (LZ77) parsing, matching statistics and the Burrows-Wheeler Transform (BWT) are all fundamental elements of stringology. In a series of recent papers, Policriti and Prezza (DCC 2016 and Algorithmica, CPM 2017) showed how we can use an augmented run-length compressed BWT (RLBWT) of the reverse $T^R$ of a text $T$, to compute offline the LZ77 parse of $T$ in $O (n \log r)$ time and $O (r)$ space, where $n$ is the length of $T$ and $r$ is the number of runs in the BWT of $T^R$. In this paper we first extend a well-known technique for updating an unaugmented RLBWT when a character is prepended to a text, to work with Policriti and Prezza's augmented RLBWT. This immediately implies that we can build online the LZ77 parse of $T$ while still using $O (n \log r)$ time and $O (r)$ space; it also seems likely to be of independent interest. Our experiments, using an extension of Ohno, Takabatake, I and Sakamoto's (IWOCA 2017) implementation of updating, show our approach is both time- and space-efficient for repetitive strings. We then show how to augment the RLBWT further --- albeit making it static again and increasing its space by a factor proportional to the size of the alphabet --- such that later, given another string $S$ and $O (\log \log n)$-time random access to $T$, we can compute the matching statistics of $S$ with respect to $T$ in $O (|S| \log \log n)$ time.

[1]  Hiroshi Sakamoto,et al.  A Faster Implementation of Online Run-Length Burrows-Wheeler Transform , 2017, IWOCA.

[2]  Fabio Cunial,et al.  Indexed Matching Statistics and Shortest Unique Substrings , 2014, SPIRE.

[3]  Markus Lohrey,et al.  Algorithmics on SLP-compressed strings: A survey , 2012, Groups Complex. Cryptol..

[4]  Alberto Policriti,et al.  Fast Online Lempel-Ziv Factorization in Compressed Space , 2015, SPIRE.

[5]  Justin Zobel,et al.  Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval , 2010, SPIRE.

[6]  Paolo Ferragina,et al.  Indexing compressed text , 2005, JACM.

[7]  Alberto Policriti,et al.  From LZ77 to the Run-Length Encoded Burrows-Wheeler Transform, and Back , 2017, CPM.

[8]  Fabio Cunial,et al.  Fast matching statistics in small space , 2018, SEA.

[9]  Juha Kärkkäinen,et al.  Lightweight Lempel-Ziv Parsing , 2013, SEA.

[10]  Veli Mäkinen,et al.  CHIC: a short read aligner for pan-genomic references , 2017, bioRxiv.

[11]  Alberto Policriti,et al.  LZ77 Computation Based on the Run-Length Encoded BWT , 2018, Algorithmica.

[12]  Enno Ohlebusch,et al.  Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction , 2013 .

[13]  Fabio Cunial,et al.  Representing the suffix tree with the CDAWG , 2017, CPM.

[14]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[15]  Simon J. Puglisi,et al.  RLZAP: Relative Lempel-Ziv with Adaptive Pointers , 2016, SPIRE.

[16]  Gonzalo Navarro,et al.  Faster Compressed Suffix Trees for Repetitive Collections , 2016, ACM J. Exp. Algorithmics.

[17]  Gonzalo Navarro,et al.  Storage and Retrieval of Highly Repetitive Sequence Collections , 2010, J. Comput. Biol..

[18]  Gonzalo Navarro,et al.  Optimal-Time Text Indexing in BWT-runs Bounded Space , 2017, SODA.

[19]  Richard Durbin,et al.  Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT) , 2014, Bioinform..

[20]  Alberto Policriti,et al.  Computing LZ77 in Run-Compressed Space , 2015, 2016 Data Compression Conference (DCC).

[21]  Stella M. Hurtley,et al.  Storage and Retrieval , 2011 .