论文信息 - A Lempel-Ziv Text Index on Secondary Storage - 字舞流文

A Lempel-Ziv Text Index on Secondary Storage

Full-text searching consists in locating the occurrences of a given pattern P[1..m] in a text T[1..u], both sequences over an alphabet of size σ. In this paper we define a new index for full-text searching on secondary storage, based on the Lempel-Ziv compression algorithm and requiring 8uHk +o(u log σ) bits of space, where Hk denotes the k-th order empirical entropy of T, for any k = o(logσ u). Our experimental results show that our index is significantly smaller than any other practical secondary-memory data structure: 1.4-2.3 times the text size including the text, which means 39%-65% the size of traditional indexes like String B-trees [Ferragina and Grossi, JACM 1999]. In exchange, our index requires more disk access to locate the pattern occurrences. Our index is able to report up to 600 occurrences per disk access, for a disk page of 32 kilobytes. If we only need to count pattern occurrences, the space can be reduced to about 1.04-1.68 times the text size, requiring about 20-60 disk accesses, depending on the pattern length.

Gonzalo Navarro | Diego Arroyuelo | G. Navarro | Diego Arroyuelo

[1] Ricardo A. Baeza-Yates,et al. Fast and flexible word searching on compressed text , 2000, TOIS.

[2] Gonzalo Navarro,et al. Indexing text using the Ziv-Lempel trie , 2002, J. Discrete Algorithms.

[3] Gonzalo Navarro,et al. Space-Efficient Construction of LZ-Index , 2005, ISAAC.

[4] Abraham Lempel,et al. Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[5] Kunihiko Sadakane,et al. Succinct representations of lcp information and improvements in the compressed suffix arrays , 2002, SODA '02.

[6] Roberto Grossi,et al. Fast string searching in secondary storage: theoretical developments and experimental results , 1996, SODA '96.

[7] Ricardo A. Baeza-Yates,et al. Hierarchies of Indices for Text Searching , 1994, Inf. Syst..

[8] Gonzalo Navarro,et al. Compressed full-text indexes , 2007, CSUR.

[9] Gonzalo Navarro,et al. Advantages of Backward Searching - Efficient Secondary Memory and Distributed Implementation of Compressed Suffix Arrays , 2004, ISAAC.

[10] D. K. Harmon,et al. Overview of the Third Text Retrieval Conference (TREC-3) , 1996 .

[11] John L. Smith. Tables , 1969, Neuromuscular Disorders.

[12] Donna K. Harman,et al. Overview of the Third Text REtrieval Conference (TREC-3) , 1995, TREC.

[13] Giovanni Manzini,et al. Indexing compressed text , 2005, JACM.

[14] Rajeev Raman,et al. Representing Trees of Higher Degree , 2005, Algorithmica.

[15] Roberto Grossi,et al. The string B-tree: a new data structure for string search in external memory and its applications , 1999, JACM.

[16] Wing-Kai Hon,et al. Succinct Data Structures for Searchable Partial Sums , 2003, ISAAC.

[17] Giovanni Manzini,et al. Compression of low entropy strings with Lempel-Ziv algorithms , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[18] Giovanni Manzini,et al. An analysis of the Burrows-Wheeler transform , 2001, SODA '99.

[19] Paolo Ferragina,et al. Indexing compressed text , 2005, JACM.

[20] Donald R. Morrison,et al. PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric , 1968, J. ACM.

[21] David R. Clark,et al. Efficient suffix trees on secondary storage , 1996, SODA '96.

[22] Rodrigo González,et al. Compressed Text Indexes with Fast Locate , 2007, CPM.

[23] Eugene W. Myers,et al. Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[24] J. Ian Munro,et al. Succinct Representation of Balanced Parentheses and Static Trees , 2002, SIAM J. Comput..

[25] Gonzalo Navarro,et al. Reducing the Space Requirement of LZ-Index , 2006, CPM.

[26] Z. Galil,et al. Combinatorial Algorithms on Words , 1985 .

[27] Alberto Apostolico,et al. The Myriad Virtues of Subword Trees , 1985 .

[28] Stefan Kurtz,et al. Reducing the space requirement of suffix trees , 1999 .