A Lempel-Ziv Text Index on Secondary Storage

Full-text searching consists in locating the occurrences of a given pattern P[1..m] in a text T[1..u], both sequences over an alphabet of size σ. In this paper we define a new index for full-text searching on secondary storage, based on the Lempel-Ziv compression algorithm and requiring 8uHk +o(u log σ) bits of space, where Hk denotes the k-th order empirical entropy of T, for any k = o(logσ u). Our experimental results show that our index is significantly smaller than any other practical secondary-memory data structure: 1.4-2.3 times the text size including the text, which means 39%-65% the size of traditional indexes like String B-trees [Ferragina and Grossi, JACM 1999]. In exchange, our index requires more disk access to locate the pattern occurrences. Our index is able to report up to 600 occurrences per disk access, for a disk page of 32 kilobytes. If we only need to count pattern occurrences, the space can be reduced to about 1.04-1.68 times the text size, requiring about 20-60 disk accesses, depending on the pattern length.

[1]  Ricardo A. Baeza-Yates,et al.  Fast and flexible word searching on compressed text , 2000, TOIS.

[2]  Gonzalo Navarro,et al.  Indexing text using the Ziv-Lempel trie , 2002, J. Discrete Algorithms.

[3]  Gonzalo Navarro,et al.  Space-Efficient Construction of LZ-Index , 2005, ISAAC.

[4]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[5]  Kunihiko Sadakane,et al.  Succinct representations of lcp information and improvements in the compressed suffix arrays , 2002, SODA '02.

[6]  Roberto Grossi,et al.  Fast string searching in secondary storage: theoretical developments and experimental results , 1996, SODA '96.

[7]  Ricardo A. Baeza-Yates,et al.  Hierarchies of Indices for Text Searching , 1994, Inf. Syst..

[8]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[9]  Gonzalo Navarro,et al.  Advantages of Backward Searching - Efficient Secondary Memory and Distributed Implementation of Compressed Suffix Arrays , 2004, ISAAC.

[10]  D. K. Harmon,et al.  Overview of the Third Text Retrieval Conference (TREC-3) , 1996 .

[11]  John L. Smith Tables , 1969, Neuromuscular Disorders.

[12]  Donna K. Harman,et al.  Overview of the Third Text REtrieval Conference (TREC-3) , 1995, TREC.

[13]  Giovanni Manzini,et al.  Indexing compressed text , 2005, JACM.

[14]  Rajeev Raman,et al.  Representing Trees of Higher Degree , 2005, Algorithmica.

[15]  Roberto Grossi,et al.  The string B-tree: a new data structure for string search in external memory and its applications , 1999, JACM.

[16]  Wing-Kai Hon,et al.  Succinct Data Structures for Searchable Partial Sums , 2003, ISAAC.

[17]  Giovanni Manzini,et al.  Compression of low entropy strings with Lempel-Ziv algorithms , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[18]  Giovanni Manzini,et al.  An analysis of the Burrows-Wheeler transform , 2001, SODA '99.

[19]  Paolo Ferragina,et al.  Indexing compressed text , 2005, JACM.

[20]  Donald R. Morrison,et al.  PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric , 1968, J. ACM.

[21]  David R. Clark,et al.  Efficient suffix trees on secondary storage , 1996, SODA '96.

[22]  Rodrigo González,et al.  Compressed Text Indexes with Fast Locate , 2007, CPM.

[23]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[24]  J. Ian Munro,et al.  Succinct Representation of Balanced Parentheses and Static Trees , 2002, SIAM J. Comput..

[25]  Gonzalo Navarro,et al.  Reducing the Space Requirement of LZ-Index , 2006, CPM.

[26]  Z. Galil,et al.  Combinatorial Algorithms on Words , 1985 .

[27]  Alberto Apostolico,et al.  The Myriad Virtues of Subword Trees , 1985 .

[28]  Stefan Kurtz,et al.  Reducing the space requirement of suffix trees , 1999 .