Space-Efficient Algorithms for Document Retrieval

We study the Document Listing problem, where a collection D of documents d1,..., dk of total length Σi di = n is to be preprocessed, so that one can later efficiently list all the ndoc documents containing a given query pattern P of length m as a substring. Muthukrishnan (SODA 2002) gave an optimal solution to the problem; with O(n) time preprocessing, one can answer the queries in O(m + ndoc) time. In this paper, we improve the space-requirement of the Muthukrishnan's solution from O(n log n) bits to |CSA| + 2n + n log k(1 + o(1)) bits, where |CSA| ≤ n log |Σ|(1 + o(1)) is the size of any suitable compressed suffix array (CSA), and Σ is the underlying alphabet of documents. The time requirement depends on the CSA used, but we can obtain e.g. the optimal O(m+ndoc) time when |Σ|, k = O(polylog(n)). For general |Σ|, k the time requirement becomes O(mlog |Σ| + ndoc log k). Sadakane (ISAAC 2002) has developed a similar space-efficient variant of the Muthukrishnan's solution; we obtain a better time requirement in most cases, but a slightly worse space requirement.

[1]  Wojciech Rytter,et al.  Jewels of stringology : text algorithms , 2002 .

[2]  Kunihiko Sadakane,et al.  Succinct data structures for flexible text retrieval systems , 2007, J. Discrete Algorithms.

[3]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[4]  Michael A. Bender,et al.  The LCA Problem Revisited , 2000, LATIN.

[5]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[6]  Ingmar Weber,et al.  Output-sensitive autocompletion search , 2006, Information Retrieval.

[7]  G. Italiano,et al.  Algorit[h]ms - ESA '98 : 6th Annual European Symposium, Venice, Italy, August 24-26, 1998 : proceedings , 1998 .

[8]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[9]  Ian H. Witten,et al.  Managing gigabytes , 1994 .

[10]  Yossi Matias,et al.  Augmenting Suffix Trees, with Applications , 1998, ESA.

[11]  Gonzalo Navarro,et al.  Compressed representations of sequences and full-text indexes , 2007, TALG.

[12]  William F. Smyth,et al.  Inverted Files Versus Suffix Arrays for Locating Patterns in Primary Memory , 2006, SPIRE.

[13]  Gonzalo Navarro,et al.  Dynamic entropy-compressed sequences and full-text indexes , 2006, TALG.

[14]  S. Muthukrishnan,et al.  Efficient algorithms for document retrieval problems , 2002, SODA '02.

[15]  Volker Heun,et al.  A New Succinct Representation of RMQ-Information and Improvements in the Enhanced Suffix Array , 2007, ESCAPE.

[16]  Kunihiko Sadakane,et al.  Space-Efficient Data Structures for Flexible Text Retrieval Systems , 2002, ISAAC.

[17]  Wojciech Rytter,et al.  Jewels of stringology , 2002 .