Optimal-Time Queries on BWT-Runs Compressed Indexes

Although a significant number of compressed indexes for highly repetitive strings have been proposed thus far, developing compressed indexes that support faster queries remains a challenge. Run-length Burrows-Wheeler transform (RLBWT) is a lossless data compression by a reversible permutation of an input string and run-length encoding, and it has become a popular research topic in string processing. R-index[Gagie et al., ACM'20] is an efficient compressed index on RLBWT whose space usage depends not on string length but the number of runs in an RLBWT, and it supports locate queries in an optimal time with $\omega(r)$ words for the number $r$ of runs in the RLBWT of an input string. Following this line of research, we present the first compressed index on RLBWT, which we call \emph{r-index-f}, that supports various queries including locate, count, extract queries, decompression and prefix search in the optimal time with smaller working space of $O(r)$ words for small alphabets in this paper. We present efficient data structures for computing two important functions of LF and $\phi^{-1}$ in constant time with $O(r)$ words of space, which is a bit step forward in computation time from the previous best result of $O(\log \log n)$ time for string length $n$ and $O(r)$ words of space. Finally, We present algorithms for computing queries on RLBWT by leveraging those two data structures in optimal time with $O(r)$ words of space.

[1]  Hideo Bannai,et al.  Refining the r-index , 2018, Theor. Comput. Sci..

[2]  S. Srinivasa Rao,et al.  Rank/select operations on large alphabets: a tool for text indexing , 2006, SODA '06.

[3]  Sebastiano Vigna,et al.  Dynamic Z-Fast Tries , 2010, SPIRE.

[4]  Milan Ruzic,et al.  Constructing Efficient Dictionaries in Close to Sorting Time , 2008, ICALP.

[5]  Kazuya Tsuruta,et al.  Dynamic Packed Compact Tries Revisited , 2019, ArXiv.

[6]  Dominik Kempa Optimal Construction of Compressed Indexes for Highly Repetitive Texts , 2019, SODA.

[7]  Meng He,et al.  Indexing Compressed Text , 2003 .

[8]  Gonzalo Navarro,et al.  Alphabet-Independent Compressed Text Indexing , 2011, TALG.

[9]  Gonzalo Navarro,et al.  Storage and Retrieval of Highly Repetitive Sequence Collections , 2010, J. Comput. Biol..

[10]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[11]  Hiroki Arimura,et al.  Packed Compact Tries: A Fast and Efficient Data Structure for Online String Processing , 2017, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[12]  Gonzalo Navarro,et al.  Universal Compressed Text Indexing , 2018, Theor. Comput. Sci..

[13]  Gonzalo Navarro,et al.  Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space , 2018, J. ACM.

[14]  Alberto Policriti,et al.  LZ77 Computation Based on the Run-Length Encoded BWT , 2018, Algorithmica.

[15]  Pawel Gawrychowski,et al.  Bookmarks in Grammar-Compressed Strings , 2016, SPIRE.

[16]  Ulrich Lauther,et al.  Space Efficient Algorithms for the Burrows-Wheeler Backtransformation , 2005, Algorithmica.

[17]  Gonzalo Navarro,et al.  Optimal-Time Dictionary-Compressed Indexes , 2018, ACM Trans. Algorithms.

[18]  Gonzalo Navarro,et al.  Improved Grammar-Based Compressed Indexes , 2012, SPIRE.

[19]  Leonard McMillan,et al.  FMLRC: Hybrid long read error correction using an FM-index , 2018, BMC Bioinformatics.

[20]  Gonzalo Navarro,et al.  New Lower and Upper Bounds for Representing Sequences , 2011, ESA.

[21]  Donald R. Morrison,et al.  PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric , 1968, J. ACM.

[22]  Gonzalo Navarro,et al.  Optimal Lower and Upper Bounds for Representing Sequences , 2011, TALG.

[23]  J. Schwartz,et al.  Annotating large genomes with exact word matches. , 2003, Genome research.

[24]  Juha Kärkkäinen,et al.  Permuted Longest-Common-Prefix Array , 2009, CPM.

[25]  Juha Kärkkäinen,et al.  LZ77-Based Self-indexing with Faster Pattern Matching , 2014, LATIN.

[26]  Mikko Berggren Ettienne,et al.  Compressed Indexing with Signature Grammars , 2018, LATIN.

[27]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[28]  Gonzalo Navarro,et al.  Compressed representations of sequences and full-text indexes , 2007, TALG.

[29]  Jan Holub,et al.  Compressing Similar Biological Sequences Using FM-Index , 2014, 2014 Data Compression Conference.