Dynamic Entropy-Compressed Sequences and Full-Text Indexes

Given a sequence of n bits with binary zero-order entropy H0, we present a dynamic data structure that requires nH0 + o(n) bits of space, which is able of performing rank and select, as well as inserting and deleting bits at arbitrary positions, in O(logn) worst-case time. This extends previous results by Hon et al. [ISAAC 2003] achieving O(logn/loglogn) time for rank and select but $\Theta({\textrm{polylog}}(n))$ amortized time for inserting and deleting bits, and requiring n + o(n) bits of space; and by Raman et al. [SODA 2002] which have constant query time but a static structure. In particular, our result becomes the first entropy-bound dynamic data structure for rank and select over bit sequences. We then show how the above result can be used to build a dynamic full-text self-index for a collection of texts over an alphabet of size σ, of overall length n and zero-order entropy H0. The index requires nH0 + o(n logσ) bits of space, and can count the number of occurrences of a pattern of length m in time O(m logn logσ). Reporting the occ occurrences can be supported in O(occ log2n logσ) time, paying O(n) extra space. Insertion of text to the collection takes O(logn logσ) time per symbol, which becomes O(log2n logσ) for deletions. This improves a previous result by Chan et al. [CPM 2004]. As a consequence, we obtain an O(n logn logσ) time construction algorithm for a compressed self-index requiring nH0 + o(n logσ) bits working space during construction.

[1]  Roberto Grossi,et al.  Squeezing succinct data structures into entropy bounds , 2006, SODA '06.

[2]  Wing-Kai Hon,et al.  Constructing Compressed Suffix Arrays with Large Alphabets , 2003, ISAAC.

[3]  Wing-Kai Hon,et al.  Compressed indexes for dynamic text collections , 2007, TALG.

[4]  Guy E. Blelloch,et al.  Compact representations of ordered sets , 2004, SODA '04.

[5]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[6]  Juha Kärkkäinen,et al.  Fast BWT in small space by blockwise suffix sorting , 2007, Theor. Comput. Sci..

[7]  Erik D. Demaine,et al.  Tight bounds for the partial-sums problem , 2004, SODA '04.

[8]  Wing-Kai Hon,et al.  Dynamic Rank/Select Dictionaries with Applications to XML Indexing , 2006 .

[9]  Raffaele Giancarlo,et al.  The myriad virtues of Wavelet Trees , 2009, Inf. Comput..

[10]  Wing-Kai Hon,et al.  Succinct Data Structures for Searchable Partial Sums , 2003, ISAAC.

[11]  Paolo Ferragina,et al.  A Theoretical and Experimental Study on the Construction of Suffix Arrays in External Memory , 2001, Algorithmica.

[12]  Paolo Ferragina,et al.  A simple storage scheme for strings achieving entropy bounds , 2007, SODA '07.

[13]  Rasmus Pagh,et al.  Low redundancy in dictionaries with O(1) worst case lookup time , 1998 .

[14]  Ian H. Witten,et al.  Text Compression , 1990, 125 Problems in Text Algorithms.

[15]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..

[16]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[17]  Rodrigo González,et al.  Statistical Encoding of Succinct Data Structures , 2006, CPM.

[18]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[19]  Gonzalo Navarro,et al.  Compressed representations of sequences and full-text indexes , 2007, TALG.

[20]  Giovanni Manzini,et al.  An analysis of the Burrows-Wheeler transform , 2001, SODA '99.

[21]  Giovanni Manzini,et al.  Compression boosting in optimal linear time using the Burrows-Wheeler Transform , 2004, SODA '04.

[22]  S. Srinivasa Rao,et al.  Rank/select operations on large alphabets: a tool for text indexing , 2006, SODA '06.

[23]  Peter Elias,et al.  Universal codeword sets and representations of the integers , 1975, IEEE Trans. Inf. Theory.

[24]  Wing-Kai Hon,et al.  Compressed Index for a Dynamic Collection of Texts , 2004, CPM.

[25]  Wing-Kai Hon,et al.  Compressed data structures: Dictionaries and data-aware measures , 2007, Theor. Comput. Sci..

[26]  Giovanni Manzini,et al.  Indexing compressed text , 2005, JACM.

[27]  Gonzalo Navarro,et al.  Rank and select revisited and extended , 2007, Theor. Comput. Sci..

[28]  Roberto Grossi,et al.  When indexing equals compression: experiments with compressing suffix arrays and applications , 2004, SODA '04.

[29]  Wing-Kai Hon,et al.  Breaking a time-and-space barrier in constructing full-text indices , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[30]  Gonzalo Navarro,et al.  Succinct Suffix Arrays based on Run-Length Encoding , 2005, Nord. J. Comput..

[31]  Rajeev Raman,et al.  Succinct Dynamic Data Structures , 2001, WADS.

[32]  Paul F. Dietz Optimal Algorithms for List Indexing and Subset Rank , 1989, WADS.

[33]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[34]  Gonzalo Navarro,et al.  Indexing text using the Ziv-Lempel trie , 2002, J. Discrete Algorithms.

[35]  Gonzalo Navarro,et al.  Space-Efficient Construction of LZ-Index , 2005, ISAAC.

[36]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[37]  Alberto Apostolico,et al.  The Myriad Virtues of Subword Trees , 1985 .

[38]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[39]  Rajeev Raman,et al.  Succinct indexable dictionaries with applications to encoding k-ary trees and multisets , 2002, SODA '02.