Optimal-Time Queries on BWT-runs Compressed Indexes

Although a significant number of compressed indexes for highly repetitive strings have been proposed thus far, developing compressed indexes that support faster queries remains a challenge. Run-length Burrows-Wheeler transform (RLBWT) is a lossless data compression by a reversible permutation of an input string and run-length encoding, and it has become a popular research topic in string processing. R-index[Gagie et al., ACM'20] is an efficient compressed index on RLBWT whose space usage depends not on string length but the number of runs in an RLBWT, and it supports locate queries in an optimal time with $\omega(r)$ words for the number $r$ of runs in the RLBWT of an input string. Following this line of research, we present the first compressed index on RLBWT, which we call \emph{r-index-f}, that supports various queries including locate, count, extract queries, decompression and prefix search in the optimal time with smaller working space of $O(r)$ words for small alphabets in this paper. We present efficient data structures for computing two important functions of LF and $\phi^{-1}$ in constant time with $O(r)$ words of space, which is a bit step forward in computation time from the previous best result of $O(\log \log n)$ time for string length $n$ and $O(r)$ words of space. Finally, We present algorithms for computing queries on RLBWT by leveraging those two data structures in optimal time with $O(r)$ words of space.

[1]  Mikko Berggren Ettienne,et al.  Compressed Indexing with Signature Grammars , 2018, LATIN.

[2]  J. Schwartz,et al.  Annotating large genomes with exact word matches. , 2003, Genome research.

[3]  Gonzalo Navarro,et al.  Alphabet-Independent Compressed Text Indexing , 2011, TALG.

[4]  Dominik Kempa Optimal Construction of Compressed Indexes for Highly Repetitive Texts , 2019, SODA.

[5]  Milan Ruzic,et al.  Constructing Efficient Dictionaries in Close to Sorting Time , 2008, ICALP.

[6]  Juha Kärkkäinen,et al.  Permuted Longest-Common-Prefix Array , 2009, CPM.

[7]  Sebastiano Vigna,et al.  Dynamic Z-Fast Tries , 2010, SPIRE.

[8]  Gonzalo Navarro,et al.  Improved Grammar-Based Compressed Indexes , 2012, SPIRE.

[9]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[10]  Ulrich Lauther,et al.  Space Efficient Algorithms for the Burrows-Wheeler Backtransformation , 2005, Algorithmica.

[11]  Juha Kärkkäinen,et al.  LZ77-Based Self-indexing with Faster Pattern Matching , 2014, LATIN.

[12]  Gonzalo Navarro,et al.  Storage and Retrieval of Highly Repetitive Sequence Collections , 2010, J. Comput. Biol..

[13]  Gonzalo Navarro,et al.  Optimal-Time Dictionary-Compressed Indexes , 2018, ACM Trans. Algorithms.

[14]  Gonzalo Navarro,et al.  New Lower and Upper Bounds for Representing Sequences , 2011, ESA.

[15]  Gonzalo Navarro,et al.  Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space , 2018, J. ACM.

[16]  Meng He,et al.  Indexing Compressed Text , 2003 .

[17]  Gonzalo Navarro,et al.  Compressed representations of sequences and full-text indexes , 2007, TALG.

[18]  Gonzalo Navarro,et al.  Universal Compressed Text Indexing , 2018, Theor. Comput. Sci..

[19]  Kazuya Tsuruta,et al.  Dynamic Packed Compact Tries Revisited , 2019, ArXiv.

[20]  S. Srinivasa Rao,et al.  Rank/select operations on large alphabets: a tool for text indexing , 2006, SODA '06.

[21]  Hideo Bannai,et al.  Refining the r-index , 2018, Theor. Comput. Sci..

[22]  Hiroki Arimura,et al.  Packed Compact Tries: A Fast and Efficient Data Structure for Online String Processing , 2017, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[23]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[24]  Donald R. Morrison,et al.  PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric , 1968, J. ACM.

[25]  Leonard McMillan,et al.  FMLRC: Hybrid long read error correction using an FM-index , 2018, BMC Bioinformatics.

[26]  Jan Holub,et al.  Compressing Similar Biological Sequences Using FM-Index , 2014, 2014 Data Compression Conference.

[27]  Pawel Gawrychowski,et al.  Bookmarks in Grammar-Compressed Strings , 2016, SPIRE.