Locally Compressed Suffix Arrays

We introduce a compression technique for suffix arrays. It is sensitive to the compressibility of the text and local, meaning that random portions of the suffix array can be decompressed by accessing mostly contiguous memory areas. This makes decompression very fast, especially when various contiguous cells must be accessed. Our main technical contributions are the following. First, we show that runs of consecutive values that are known to appear in function Ψ(i) = A−1[A[i] + 1] of suffix arrays A of compressible texts also show up as repetitions in the differential suffix array A'[i] = A[i] − A[i−1]. Second, we use Re-Pair, a grammar-based compressor, to compress the differential suffix array, and upper bound its compression ratio in terms of the number of runs. Third, we show how to compact the space used by the grammar rules by up to 50%, while still permitting direct access to the rules. Fourth, we develop specific variants of Re-Pair that work using knowledge of Ψ, and use much less space than the general Re-Pair compressor, while achieving almost the same compression ratios. Fifth, we implement the scheme and compare it exhaustively with previous work, including the first implementations of previous theoretical proposals.

[1]  SadakaneKunihiko Compressed Suffix Trees with Full Functionality , 2007 .

[2]  Juha Kärkkäinen,et al.  Sparse Suffix Trees , 1996, COCOON.

[3]  L. R. Rasmussen,et al.  In information retrieval: data structures and algorithms , 1992 .

[4]  Wojciech Rytter,et al.  Jewels of stringology : text algorithms , 2002 .

[5]  David Richard Clark,et al.  Compact pat trees , 1998 .

[6]  Gaston H. Gonnet,et al.  Fast text searching for regular expressions or automaton searching on tries , 1996, JACM.

[7]  Paolo Ferragina,et al.  A Theoretical and Experimental Study on the Construction of Suffix Arrays in External Memory , 2001, Algorithmica.

[8]  Gaston H. Gonnet,et al.  New Indices for Text: Pat Trees and Pat Arrays , 1992, Information Retrieval: Data Structures & Algorithms.

[9]  V. Vinay,et al.  Proceedings of the 16th Conference on Foundations of Software Technology and Theoretical Computer Science , 1996 .

[10]  Gonzalo Navarro,et al.  Indexing text using the Ziv-Lempel trie , 2002, J. Discrete Algorithms.

[11]  Enno Ohlebusch,et al.  Replacing suffix trees with enhanced suffix arrays , 2004, J. Discrete Algorithms.

[12]  G. Navarro,et al.  A Compressed Text Index on Secondary Memory ∗ , 2007 .

[13]  Gonzalo Navarro,et al.  Fast and Compact Web Graph Representations , 2010, TWEB.

[14]  Juha Kärkkäinen,et al.  Fixed Block Compression Boosting in FM-Indexes , 2011, SPIRE.

[15]  Ulrich Meyer,et al.  An experimental study of priority queues in external memory , 2000, JEAL.

[16]  Gonzalo Navarro,et al.  Compressed representations of sequences and full-text indexes , 2007, TALG.

[17]  Gonzalo Navarro,et al.  Storage and Retrieval of Highly Repetitive Sequence Collections , 2010, J. Comput. Biol..

[18]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[19]  Rodrigo González,et al.  Compressed Text Indexes with Fast Locate , 2007, CPM.

[20]  Giovanni Manzini,et al.  An analysis of the Burrows-Wheeler transform , 2001, SODA '99.

[21]  Gonzalo Navarro,et al.  Faster entropy-bounded compressed suffix trees , 2009, Theor. Comput. Sci..

[22]  Gonzalo Navarro,et al.  Practical Compressed Suffix Trees , 2010, SEA.

[23]  Peter Sanders,et al.  Better external memory suffix array construction , 2008, JEAL.

[24]  S. Srinivasa Rao,et al.  Full-Text Indexes in External Memory , 2002, Algorithms for Memory Hierarchies.

[25]  Johannes Fischer,et al.  Wee LCP , 2009, Inf. Process. Lett..

[26]  Giovanni Manzini,et al.  Indexing compressed text , 2005, JACM.

[27]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[28]  Alberto Apostolico,et al.  The Myriad Virtues of Subword Trees , 1985 .

[29]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[30]  Gonzalo Navarro,et al.  Practical Compressed Document Retrieval , 2011, SEA.

[31]  Gonzalo Navarro,et al.  Alphabet-Independent Compressed Text Indexing , 2011, TALG.

[32]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[33]  Rajeev Raman,et al.  Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets , 2007, ACM Trans. Algorithms.

[34]  Juha Kärkkäinen Suffix Cactus: A Cross between Suffix Tree and Suffix Array , 1995, CPM.

[35]  Ulrich Meyer,et al.  Algorithms for Memory Hierarchies , 2003, Lecture Notes in Computer Science.

[36]  Roberto Grossi,et al.  The string B-tree: a new data structure for string search in external memory and its applications , 1999, JACM.

[37]  Kunihiko Sadakane,et al.  Compressed Text Databases with Efficient Query Algorithms Based on the Compressed Suffix Array , 2000, ISAAC.

[38]  Wojciech Rytter,et al.  Jewels of stringology , 2002 .

[39]  Gonzalo Navarro,et al.  Extended Compact Web Graph Representations , 2010, Algorithms and Applications.

[40]  Gonzalo Navarro,et al.  String matching with alphabet sampling , 2012, J. Discrete Algorithms.

[41]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[42]  S. Srinivasa Rao Time-space trade-offs for compressed suffix arrays , 2002, Inf. Process. Lett..

[43]  Kunihiko Sadakane,et al.  Compressed Suffix Trees with Full Functionality , 2007, Theory of Computing Systems.

[44]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[45]  Rodrigo González,et al.  Compressed text indexes: From theory to practice , 2007, JEAL.

[46]  Esko Ukkonen,et al.  Lempel-Ziv parsing and sublinear-size index structures for string matching , 1996 .

[47]  Ricardo A. Baeza-Yates,et al.  Hierarchies of Indices for Text Searching , 1994, Inf. Syst..

[48]  Alistair Moffat,et al.  Off-line dictionary-based compression , 1999, Proceedings of the IEEE.

[49]  John L. Smith Tables , 1969, Neuromuscular Disorders.

[50]  David R. Clark,et al.  Efficient suffix trees on secondary storage , 1996, SODA '96.

[51]  Peter Elias,et al.  Efficient Storage and Retrieval by Content and Address of Static Files , 1974, JACM.

[52]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..

[53]  Maxime Crochemore,et al.  Occurrence and Substring Heuristics for i-Matching , 2003, Fundam. Informaticae.

[54]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[55]  Esko Ukkonen,et al.  Approximate String-Matching over Suffix Trees , 1993, CPM.

[56]  Veli Mäkinen,et al.  Compact Suffix Array , 2000, CPM.

[57]  Miguel A. Martínez-Prieto,et al.  Compressed q-Gram Indexing for Highly Repetitive Biological Sequences , 2010, 2010 IEEE International Conference on BioInformatics and BioEngineering.

[58]  Miguel A. Martínez-Prieto,et al.  Indexes for highly repetitive document collections , 2011, CIKM '11.

[59]  Gonzalo Navarro,et al.  Colored range queries and document retrieval , 2010, Theor. Comput. Sci..

[60]  Roberto Grossi,et al.  Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract) , 2000, STOC '00.

[61]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[62]  Gonzalo Navarro,et al.  Advantages of Backward Searching - Efficient Secondary Memory and Distributed Implementation of Compressed Suffix Arrays , 2004, ISAAC.

[63]  Veli Mäkinen Compact Suffix Array - A Space-Efficient Full-Text Index , 2003, Fundam. Informaticae.

[64]  Kunihiko Sadakane,et al.  New text indexing functionalities of the compressed suffix arrays , 2003, J. Algorithms.

[65]  Kunihiko Sadakane,et al.  Practical Entropy-Compressed Rank/Select Dictionary , 2006, ALENEX.

[66]  Guy Jacobson,et al.  Space-efficient static trees and graphs , 1989, 30th Annual Symposium on Foundations of Computer Science.

[67]  Gonzalo Navarro,et al.  Succinct Suffix Arrays based on Run-Length Encoding , 2005, Nord. J. Comput..

[68]  Gonzalo Navarro,et al.  Fully compressed suffix trees , 2008, TALG.

[69]  Rodrigo González,et al.  Statistical Encoding of Succinct Data Structures , 2006, CPM.

[70]  Stefan Kurtz,et al.  Reducing the space requirement of suffix trees , 1999 .