论文信息 - Space-Efficient Algorithms for Document Retrieval

Space-Efficient Algorithms for Document Retrieval

We study the Document Listing problem, where a collection D of documents d1,..., dk of total length Σi di = n is to be preprocessed, so that one can later efficiently list all the ndoc documents containing a given query pattern P of length m as a substring. Muthukrishnan (SODA 2002) gave an optimal solution to the problem; with O(n) time preprocessing, one can answer the queries in O(m + ndoc) time. In this paper, we improve the space-requirement of the Muthukrishnan's solution from O(n log n) bits to |CSA| + 2n + n log k(1 + o(1)) bits, where |CSA| ≤ n log |Σ|(1 + o(1)) is the size of any suitable compressed suffix array (CSA), and Σ is the underlying alphabet of documents. The time requirement depends on the CSA used, but we can obtain e.g. the optimal O(m+ndoc) time when |Σ|, k = O(polylog(n)). For general |Σ|, k the time requirement becomes O(mlog |Σ| + ndoc log k). Sadakane (ISAAC 2002) has developed a similar space-efficient variant of the Muthukrishnan's solution; we obtain a better time requirement in most cases, but a slightly worse space requirement.

Veli Mäkinen | Niko Välimäki

[1] Wojciech Rytter,et al. Jewels of stringology : text algorithms , 2002 .

[2] Kunihiko Sadakane,et al. Succinct data structures for flexible text retrieval systems , 2007, J. Discrete Algorithms.

[3] Gonzalo Navarro,et al. Compressed full-text indexes , 2007, CSUR.

[4] Michael A. Bender,et al. The LCA Problem Revisited , 2000, LATIN.

[5] Ian H. Witten,et al. Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[6] Ingmar Weber,et al. Output-sensitive autocompletion search , 2006, Information Retrieval.

[7] G. Italiano,et al. Algorit[h]ms - ESA '98 : 6th Annual European Symposium, Venice, Italy, August 24-26, 1998 : proceedings , 1998 .

[8] E. Myers,et al. Basic local alignment search tool. , 1990, Journal of molecular biology.

[9] Ian H. Witten,et al. Managing gigabytes , 1994 .

[10] Yossi Matias,et al. Augmenting Suffix Trees, with Applications , 1998, ESA.

[11] Gonzalo Navarro,et al. Compressed representations of sequences and full-text indexes , 2007, TALG.

[12] William F. Smyth,et al. Inverted Files Versus Suffix Arrays for Locating Patterns in Primary Memory , 2006, SPIRE.

[13] Gonzalo Navarro,et al. Dynamic entropy-compressed sequences and full-text indexes , 2006, TALG.

[14] S. Muthukrishnan,et al. Efficient algorithms for document retrieval problems , 2002, SODA '02.

[15] Volker Heun,et al. A New Succinct Representation of RMQ-Information and Improvements in the Enhanced Suffix Array , 2007, ESCAPE.

[16] Kunihiko Sadakane,et al. Space-Efficient Data Structures for Flexible Text Retrieval Systems , 2002, ISAAC.

[17] Wojciech Rytter,et al. Jewels of stringology , 2002 .