Practical approaches to reduce the space requirement of lempel-ziv--based compressed text indices

Given a text <i>T</i>[1¨<i>n</i>] over an alphabet of size σ, the <i>full-text search</i> problem consists in locating the <i>occ</i> occurrences of a given pattern <i>P</i>[1¨<i>m</i>] in <i>T</i>. <i>Compressed full-text self-indices</i> are space-efficient representations of the text that provide direct access to and indexed search on it. The LZ-index of Navarro is a compressed full-text self-index based on the LZ78 compression algorithm. This index requires about 5 times the size of the compressed text (in theory, 4<i>nH</i><sub><i>k</i></sub>(<i>T</i>)+<i>o</i>(<i>n</i>logσ) bits of space, where <i>H</i><sub><i>k</i></sub>(<i>T</i>) is the <i>k</i>-th order empirical entropy of <i>T</i>). In practice, the average locating complexity of the LZ-index is <i>O</i>(σ <i>m</i> log<sub>σ</sub> <i>n</i> + <i>occ</i> σ<sup><i>m</i></sup>/2), where <i>occ</i> is the number of occurrences of <i>P</i>. It can extract text substrings of length ℓ in <i>O</i>(ℓ) time. This index outperforms competing schemes both to locate short patterns and to extract text snippets. However, the LZ-index can be up to 4 times larger than the smallest existing indices (which use <i>nH</i><sub><i>k</i></sub>(<i>T</i>)+<i>o</i>(<i>n</i>logσ) bits in theory), and it does not offer space/time tuning options. This limits its applicability. In this article, we study practical ways to reduce the space of the LZ-index. We obtain new LZ-index variants that require 2(1+&epsis;)<i>nH</i><sub><i>k</i></sub>(<i>T</i>) + <i>o</i>(<i>n</i>logσ) bits of space, for any 0<&epsis; <1. They have an average locating time of <i>O</i>(1/&epsis;(<i>m</i>log <i>n</i> + <i>occ</i> σ<sup><i>m</i>/2</sup>)), while extracting takes <i>O</i>(ℓ) time. We perform extensive experimentation and conclude that our schemes are able to reduce the space of the original LZ-index by a factor of 2/3, that is, around 3 times the compressed text size. Our schemes are able to extract about 1 to 2 MB of the text per second, being twice as fast as the most competitive alternatives. Pattern occurrences are located at a rate of up to 1 to 4 million per second. This constitutes the best space/time trade-off when indices are allowed to use 4 times the size of the compressed text or more.

[1]  S. Srinivasa Rao,et al.  Rank/select operations on large alphabets: a tool for text indexing , 2006, SODA '06.

[2]  Gonzalo Navarro,et al.  A Lempel-Ziv Text Index on Secondary Storage , 2007, CPM.

[3]  Kunihiko Sadakane,et al.  Ultra-succinct representation of ordered trees , 2007, SODA '07.

[4]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[5]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[6]  Rajeev Raman,et al.  Representing Trees of Higher Degree , 2005, Algorithmica.

[7]  V. Vinay,et al.  Proceedings of the 16th Conference on Foundations of Software Technology and Theoretical Computer Science , 1996 .

[8]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[9]  Donald R. Morrison,et al.  PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric , 1968, J. ACM.

[10]  Wojciech Szpankowski,et al.  Height in a digital search tree and the longest phrase of the Lempel-Ziv scheme , 2000, SODA '00.

[11]  Rodrigo González,et al.  Compressed text indexes: From theory to practice , 2007, JEAL.

[12]  Esko Ukkonen,et al.  Lempel-Ziv parsing and sublinear-size index structures for string matching , 1996 .

[13]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[14]  Roberto Grossi,et al.  Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract) , 2000, STOC '00.

[15]  Giovanni Manzini,et al.  Compression of Low Entropy Strings with Lempel-Ziv Algorithms , 1999, SIAM J. Comput..

[16]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[17]  J. Ian Munro,et al.  Succinct Representation of Balanced Parentheses and Static Trees , 2002, SIAM J. Comput..

[18]  Gonzalo Navarro,et al.  Reducing the Space Requirement of LZ-Index , 2006, CPM.

[19]  R. González,et al.  PRACTICAL IMPLEMENTATION OF RANK AND SELECT QUERIES , 2005 .

[20]  Naila Rahman,et al.  A simple optimal representation for balanced parentheses , 2004, Theor. Comput. Sci..

[21]  Gonzalo Navarro,et al.  Stronger Lempel-Ziv Based Compressed Text Indexing , 2012, Algorithmica.

[22]  Rajeev Raman,et al.  Succinct Representations of Permutations , 2003, ICALP.

[23]  S. Srinivasa Rao,et al.  Succinct indexes for strings, binary relations and multi-labeled trees , 2007, SODA '07.

[24]  Joong Chae Na,et al.  Efficient Implementation of Rank and Select Functions for Succinct Representation , 2005, WEA.

[25]  Giovanni Manzini,et al.  An analysis of the Burrows-Wheeler transform , 2001, SODA '99.

[26]  Kunihiko Sadakane,et al.  Compressed Text Databases with Efficient Query Algorithms Based on the Compressed Suffix Array , 2000, ISAAC.

[27]  Alberto Apostolico,et al.  The Myriad Virtues of Subword Trees , 1985 .

[28]  Rajeev Raman,et al.  Succinct indexable dictionaries with applications to encoding k-ary trees and multisets , 2002, SODA '02.

[29]  Gonzalo Navarro,et al.  Succinct Suffix Arrays based on Run-Length Encoding , 2005, Nord. J. Comput..

[30]  Gonzalo Navarro,et al.  Compressed representations of sequences and full-text indexes , 2007, TALG.

[31]  Gonzalo Navarro,et al.  Implementing the LZ-index: Theory versus practice , 2009, JEAL.

[32]  Giovanni Manzini,et al.  Indexing compressed text , 2005, JACM.

[33]  Siu-Ming Yiu,et al.  A Space and Time Efficient Algorithm for Constructing Compressed Suffix Arrays , 2002, COCOON.

[34]  Gonzalo Navarro,et al.  Indexing text using the Ziv-Lempel trie , 2002, J. Discrete Algorithms.

[35]  Gonzalo Navarro,et al.  Space-Efficient Construction of LZ-Index , 2005, ISAAC.

[36]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[37]  Luís M. S. Russo,et al.  A compressed self-index using a Ziv–Lempel dictionary , 2006, Information Retrieval.

[38]  Kunihiko Sadakane,et al.  New text indexing functionalities of the compressed suffix arrays , 2003, J. Algorithms.

[39]  Kunihiko Sadakane,et al.  Practical Entropy-Compressed Rank/Select Dictionary , 2006, ALENEX.

[40]  Gonzalo Navarro,et al.  Dynamic entropy-compressed sequences and full-text indexes , 2006, TALG.

[41]  John L. Smith Tables , 1969, Neuromuscular Disorders.

[42]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..