Improved and extended locating functionality on compressed suffix arrays

Compressed Suffix Arrays (CSAs) offer the same functionality as classical suffix arrays (SAs), and more, within space close to that of the compressed text, and in addition they can reproduce any text fragment. Furthermore, their pattern search times are comparable to those of SAs. This combination has made CSAs extremely successful substitutes for SAs on space-demanding applications. Their weakest point is that they are orders of magnitude slower when reporting the precise positions of pattern occurrences. SAs have other well-known shortcomings, inherited by CSAs, such as retrieving those positions in arbitrary order. In this paper we present new techniques that, on one hand, improve the current space/time tradeoffs for locating pattern occurrences on CSAs, and on the other, efficiently support extended pattern locating functionalities, such as reporting occurrences in text order or limiting the occurrences to within a text window. Our experimental results display considerable savings with respect to the baseline techniques.

[1]  Giovanni Manzini,et al.  An analysis of the Burrows-Wheeler transform , 2001, SODA '99.

[2]  Gonzalo Navarro,et al.  Faster entropy-bounded compressed suffix trees , 2009, Theor. Comput. Sci..

[3]  Rajeev Raman,et al.  Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets , 2007, ACM Trans. Algorithms.

[4]  Kunihiko Sadakane,et al.  Compressed Text Databases with Efficient Query Algorithms Based on the Compressed Suffix Array , 2000, ISAAC.

[5]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[6]  Jeffrey Scott Vitter,et al.  A Practical Implementation of Compressed Suffix Arrays with Applications to Self-Indexing , 2014, 2014 Data Compression Conference.

[7]  Torsten Suel,et al.  Performance of compressed inverted list caching in search engines , 2008, WWW.

[8]  Rajeev Raman,et al.  Succinct Representations of Permutations , 2003, ICALP.

[9]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[10]  Kunihiko Sadakane,et al.  New text indexing functionalities of the compressed suffix arrays , 2003, J. Algorithms.

[11]  S. Muthukrishnan,et al.  Efficient algorithms for document retrieval problems , 2002, SODA '02.

[12]  Kunihiko Sadakane,et al.  Practical Entropy-Compressed Rank/Select Dictionary , 2006, ALENEX.

[13]  Simon Gog,et al.  Optimized succinct data structures for massive data , 2014, Softw. Pract. Exp..

[14]  Volker Heun,et al.  Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays , 2011, SIAM J. Comput..

[15]  Dong Kyue Kim,et al.  Constructing suffix arrays in linear time , 2005, J. Discrete Algorithms.

[16]  Kunihiko Sadakane,et al.  Succinct data structures for flexible text retrieval systems , 2007, J. Discrete Algorithms.

[17]  Paolo Ferragina,et al.  Indexing compressed text , 2005, JACM.

[18]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[19]  Wing-Kai Hon,et al.  On position restricted substring searching in succinct space , 2012, J. Discrete Algorithms.

[20]  Gonzalo Navarro,et al.  Sorted Range Reporting , 2012, SWAT.

[21]  Peter Elias,et al.  Efficient Storage and Retrieval by Content and Address of Static Files , 1974, JACM.

[22]  Gonzalo Navarro,et al.  Rank and select revisited and extended , 2007, Theor. Comput. Sci..

[23]  Rodrigo González,et al.  Locally Compressed Suffix Arrays , 2015, ACM J. Exp. Algorithmics.

[24]  Peter Sanders,et al.  Linear work suffix array construction , 2006, JACM.

[25]  Roberto Grossi,et al.  Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract) , 2000, STOC '00.

[26]  Gonzalo Navarro,et al.  Improved and extended locating functionality on compressed suffix arrays , 2015, J. Discrete Algorithms.

[27]  Gaston H. Gonnet,et al.  New Indices for Text: Pat Trees and Pat Arrays , 1992, Information Retrieval: Data Structures & Algorithms.

[28]  Giovanni Manzini,et al.  Indexing compressed text , 2005, JACM.

[29]  Sebastiano Vigna,et al.  Quasi-succinct indices , 2012, WSDM.

[30]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[31]  Rodrigo González,et al.  Compressed text indexes: From theory to practice , 2007, JEAL.

[32]  Giuseppe Ottaviano,et al.  Partitioned Elias-Fano indexes , 2014, SIGIR.

[33]  Alistair Moffat,et al.  From Theory to Practice: Plug and Play with Succinct Data Structures , 2013, SEA.

[34]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[35]  Torsten Suel,et al.  Inverted index compression and query processing with optimized document ordering , 2009, WWW '09.

[36]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[38]  S. Srinivasa Rao Time-space trade-offs for compressed suffix arrays , 2002, Inf. Process. Lett..

[39]  Kunihiko Sadakane,et al.  Compressed Suffix Trees with Full Functionality , 2007, Theory of Computing Systems.

[40]  Srinivas Aluru,et al.  Space efficient linear time construction of suffix arrays , 2005, J. Discrete Algorithms.

[41]  Gonzalo Navarro,et al.  Alphabet-Independent Compressed Text Indexing , 2011, ESA.

[42]  Paolo Ferragina,et al.  Distribution-Aware Compressed Full-Text Indexes , 2011, ESA.

[43]  Philip Bille,et al.  Substring Range Reporting , 2011, CPM.

[44]  Rajeev Raman,et al.  Succinct indexable dictionaries with applications to encoding k-ary trees and multisets , 2002, SODA '02.

[45]  L. R. Rasmussen,et al.  In information retrieval: data structures and algorithms , 1992 .

[46]  Leonid Boytsov,et al.  Decoding billions of integers per second through vectorization , 2012, Softw. Pract. Exp..

[47]  Gonzalo Navarro,et al.  Compressed representations of sequences and full-text indexes , 2007, TALG.

[48]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..

[49]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[50]  Paolo Ferragina,et al.  On Optimally Partitioning a Text to Improve Its Compression , 2009, Algorithmica.