Space-efficient construction of Lempel-Ziv compressed text indexes

A compressed full-text self-index is a data structure that replaces a text and in addition gives indexed access to it, while taking space proportional to the compressed text size. This is very important nowadays, since one can accommodate the index of very large texts entirely in main memory, avoiding the slower access to secondary storage. In particular, the LZ-index [G. Navarro, Indexing text using the Ziv-Lempel trie, Journal of Discrete Algorithms (JDA) 2 (1) (2004) 87-114] stands out for its good performance at extracting text passages and locating pattern occurrences. Given a text T[1..u] over an alphabet of size @s, the LZ-index requires 4|LZ|(1+o(1)) bits of space, where |LZ| is the size of the LZ78-compression of T. This can be bounded by |LZ|=uH"k(T)+o(ulog@s), where H"k(T) is the k-th order empirical entropy of T, for any k=o(log"@su). The LZ-index is built in O(ulog@s) time, yet requiring O(ulogu) bits of main memory in the worst case. In practice, the LZ-index occupies 1.0-1.5 times the text size (and replaces the text), but its construction requires around 5 times the text size. This limits its applicability to medium-sized texts. In this paper we present a space-efficient algorithm to construct the LZ-index in O(u(log@s+loglogu)) time and requiring 4|LZ|(1+o(1)) bits of main memory, that is, asymptotically the same space of the final index. We also adapt our algorithm to construct more recent reduced versions of the LZ-index, which occupy from 1 to 3 times |LZ|(1+o(1)) bits, and show that these can also be built using asymptotically the same space of the final index. Finally, we study an alternative model in which we are given only a limited amount of main memory to carry out the indexing process (less than that required by the final index), and must use the disk for the rest. We show how to build all the LZ-index variants in O(u(log@s+loglogu)) time, and within |LZ|(1+o(1)) bits of main memory, that is, asymptotically just the space to hold the LZ78-compressed text. Our experimental results show that our method is efficient in practice, needing an amount of memory close to that of the final index, and being competitive with the best construction times of other compressed indexes.

[1]  Faith Ellen,et al.  Permuting in Place , 1995, SIAM J. Comput..

[2]  Gonzalo Navarro,et al.  Indexing text using the Ziv-Lempel trie , 2002, J. Discrete Algorithms.

[3]  Gonzalo Navarro,et al.  Space-Efficient Construction of LZ-Index , 2005, ISAAC.

[4]  Tetsuo Shibuya,et al.  Indexing huge genome sequences for solving various problems. , 2001, Genome informatics. International Conference on Genome Informatics.

[5]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[6]  Jeffrey Scott Vitter,et al.  Algorithms and Data Structures for External Memory , 2008, Found. Trends Theor. Comput. Sci..

[7]  Gonzalo Navarro,et al.  Dynamic entropy-compressed sequences and full-text indexes , 2006, TALG.

[8]  John L. Smith Tables , 1969, Neuromuscular Disorders.

[9]  Rodrigo González,et al.  Rank/select on dynamic compressed sequences and applications , 2009, Theor. Comput. Sci..

[10]  Erik D. Demaine,et al.  Resizable Arrays in Optimal Time and Space , 1999, WADS.

[11]  Kunihiko Sadakane,et al.  Compressed Dynamic Tries with Applications to LZ-Compression in Sublinear Time and Space , 2007, FSTTCS.

[12]  Roberto Grossi,et al.  Squeezing succinct data structures into entropy bounds , 2006, SODA '06.

[13]  Rajeev Raman,et al.  Representing Trees of Higher Degree , 2005, Algorithmica.

[14]  Wing-Kai Hon,et al.  Constructing Compressed Suffix Arrays with Large Alphabets , 2003, ISAAC.

[15]  Travis Gagie,et al.  Lightweight Data Indexing and Compression in External Memory , 2010, LATIN.

[16]  Rajeev Raman,et al.  Succinct Dynamic Dictionaries and Trees , 2003, ICALP.

[17]  Kunihiko Sadakane,et al.  Faster suffix sorting , 2007, Theoretical Computer Science.

[18]  Esko Ukkonen,et al.  Lempel-Ziv parsing and sublinear-size index structures for string matching , 1996 .

[19]  Diego Arroyuelo,et al.  An Improved Succinct Representation for Dynamic k-ary Trees , 2008, CPM.

[20]  Wing-Kai Hon,et al.  Breaking a Time-and-Space Barrier in Constructing Full-Text Indices , 2009, SIAM J. Comput..

[21]  Bernard Chazelle,et al.  A Functional Approach to Data Structures and Its Use in Multidimensional Searching , 1988, SIAM J. Comput..

[22]  Gonzalo Navarro,et al.  Succinct Trees in Practice , 2010, ALENEX.

[23]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[24]  Kunihiko Sadakane,et al.  Ultra-succinct representation of ordered trees , 2007, SODA '07.

[25]  Stefan Kurtz,et al.  Reducing the space requirement of suffix trees , 1999 .

[26]  Jouni Sirén,et al.  Compressed Suffix Arrays for Massive Data , 2009, SPIRE.

[27]  Giovanni Manzini,et al.  Engineering a Lightweight Suffix Array Construction Algorithm , 2002, ESA.

[28]  Juha Kärkkäinen Suffix Cactus: A Cross between Suffix Tree and Suffix Array , 1995, CPM.

[29]  David R. Clark,et al.  Efficient suffix trees on secondary storage , 1996, SODA '96.

[30]  Rajeev Raman,et al.  Succinct Representations of Permutations , 2003, ICALP.

[31]  Rodrigo González,et al.  Improved Dynamic Rank-Select Entropy-Bound Structures , 2008, LATIN.

[32]  Joong Chae Na,et al.  Efficient Implementation of Rank and Select Functions for Succinct Representation , 2005, WEA.

[33]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..

[34]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[35]  S. Srinivasa Rao,et al.  Succinct indexes for strings, binary relations and multi-labeled trees , 2007, SODA '07.

[36]  S. Srinivasa Rao,et al.  Rank/select operations on large alphabets: a tool for text indexing , 2006, SODA '06.

[37]  Gonzalo Navarro,et al.  Practical approaches to reduce the space requirement of lempel-ziv--based compressed text indices , 2010, JEAL.

[38]  Clifford Stein,et al.  Introduction to Algorithms, 2nd edition. , 2001 .

[39]  Gonzalo Navarro,et al.  Rank and select revisited and extended , 2007, Theor. Comput. Sci..

[40]  Rajeev Raman,et al.  Succinct indexable dictionaries with applications to encoding k-ary trees and multisets , 2002, SODA '02.

[42]  Wing-Kai Hon,et al.  Compressed indexes for dynamic text collections , 2007, TALG.

[43]  Giovanni Manzini,et al.  Compression of Low Entropy Strings with Lempel-Ziv Algorithms , 1999, SIAM J. Comput..

[44]  R. González,et al.  PRACTICAL IMPLEMENTATION OF RANK AND SELECT QUERIES , 2005 .

[45]  Gonzalo Navarro,et al.  Stronger Lempel-Ziv Based Compressed Text Indexing , 2012, Algorithmica.

[46]  J. Ian Munro,et al.  Succinct Representation of Balanced Parentheses and Static Trees , 2002, SIAM J. Comput..

[47]  Gonzalo Navarro,et al.  Reducing the Space Requirement of LZ-Index , 2006, CPM.

[48]  Ian H. Witten,et al.  Managing gigabytes , 1994 .

[49]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[50]  Donald R. Morrison,et al.  PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric , 1968, J. ACM.

[51]  Siu-Ming Yiu,et al.  A Space and Time Efficient Algorithm for Constructing Compressed Suffix Arrays , 2002, COCOON.

[52]  Rodrigo González,et al.  Compressed text indexes: From theory to practice , 2007, JEAL.

[53]  Ricardo A. Baeza-Yates,et al.  Adding Compression to Block Addressing Inverted Indexes , 2000, Information Retrieval.

[54]  Luís M. S. Russo,et al.  A compressed self-index using a Ziv–Lempel dictionary , 2006, Information Retrieval.

[55]  S. Srinivasa Rao,et al.  Space Efficient Suffix Trees , 1998, J. Algorithms.

[56]  Gonzalo Navarro,et al.  Compressed representations of sequences and full-text indexes , 2007, TALG.

[57]  Giovanni Manzini,et al.  An analysis of the Burrows-Wheeler transform , 2001, SODA '99.

[58]  Peter Sanders,et al.  Better external memory suffix array construction , 2008, JEAL.

[59]  Meng He,et al.  Indexing Compressed Text , 2003 .

[60]  Gonzalo Navarro,et al.  Storage and Retrieval of Highly Repetitive Sequence Collections , 2010, J. Comput. Biol..

[61]  Joong Chae Na,et al.  Alphabet-independent linear-time construction of compressed suffix arrays using o(nlogn)-bit working space , 2007, Theor. Comput. Sci..

[62]  Wing-kai. Hon,et al.  On the construction and application of compressed text indexes , 2004 .

[63]  Gianni Franceschini,et al.  In-Place Suffix Sorting , 2007, ICALP.

[64]  Roberto Grossi,et al.  When indexing equals compression: experiments with compressing suffix arrays and applications , 2004, SODA '04.

[65]  Gonzalo Navarro,et al.  Succinct Suffix Arrays based on Run-Length Encoding , 2005, Nord. J. Comput..

[66]  Gonzalo Navarro,et al.  Fully-functional succinct trees , 2010, SODA '10.

[67]  Ian H. Witten,et al.  Managing gigabytes 2nd edition , 1999 .

[68]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[69]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[70]  Gonzalo Navarro,et al.  Implementing the LZ-index: Theory versus practice , 2009, JEAL.

[71]  Enno Ohlebusch,et al.  Optimal Exact Strring Matching Based on Suffix Arrays , 2002, SPIRE.

[72]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[73]  Juha Kärkkäinen,et al.  Fast BWT in small space by blockwise suffix sorting , 2007, Theor. Comput. Sci..

[74]  Naila Rahman,et al.  A simple optimal representation for balanced parentheses , 2004, Theor. Comput. Sci..

[75]  Veli Mäkinen Compact Suffix Array - A Space-Efficient Full-Text Index , 2003, Fundam. Informaticae.

[76]  Kunihiko Sadakane,et al.  New text indexing functionalities of the compressed suffix arrays , 2003, J. Algorithms.

[77]  Kunihiko Sadakane,et al.  Practical Entropy-Compressed Rank/Select Dictionary , 2006, ALENEX.

[78]  J. Ian Munro,et al.  Succinct Representations of Dynamic Strings , 2010, SPIRE.

[79]  Alberto Apostolico,et al.  The Myriad Virtues of Subword Trees , 1985 .