Improved compressed indexes for full-text document retrieval

We give new space/time tradeoffs for compressed indexes that answer document retrieval queries on general sequences. On a collection of D documents of total length n, current approaches require at least |CSA| + O(n lgD/lg lgD) or 2|CSA| + o(n) bits of space, where CSA is a full-text index. Using monotone minimum perfect hash functions, we give new algorithms for document listing with frequencies and top-k document retrieval using just |CSA| + O(n lg lg lgD) bits. We also improve current solutions that use 2|CSA| + o(n) bits, and consider other problems such as colored range listing, top-k most important documents, and computing arbitrary frequencies.

[1]  Giovanni Manzini,et al.  An analysis of the Burrows-Wheeler transform , 2001, SODA '99.

[2]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[3]  Johannes Fischer,et al.  Optimal Succinctness for Range Minimum Queries , 2008, LATIN.

[4]  John L. Smith Tables , 1969, Neuromuscular Disorders.

[5]  Wing-Kai Hon,et al.  Space-Efficient Framework for Top-k String Retrieval Problems , 2009, 2009 50th Annual IEEE Symposium on Foundations of Computer Science.

[6]  Kunihiko Sadakane,et al.  New text indexing functionalities of the compressed suffix arrays , 2003, J. Algorithms.

[7]  Kunihiko Sadakane,et al.  Practical Entropy-Compressed Rank/Select Dictionary , 2006, ALENEX.

[8]  Gonzalo Navarro,et al.  Compressed representations of sequences and full-text indexes , 2007, TALG.

[9]  Gonzalo Navarro,et al.  Colored range queries and document retrieval , 2013, Theor. Comput. Sci..

[10]  Rajeev Raman,et al.  Optimal Trade-Offs for Succinct String Indexes , 2010, ICALP.

[11]  Rajeev Raman,et al.  Succinct indexable dictionaries with applications to encoding k-ary trees and multisets , 2002, SODA '02.

[12]  Dan E. Willard,et al.  Log-logarithmic worst-case range queries are possible in space ⊕(N) , 1983 .

[13]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[14]  Simon J. Puglisi,et al.  Range Quantile Queries: Another Virtue of Wavelet Trees , 2009, SPIRE.

[15]  Gonzalo Navarro,et al.  Fully-functional succinct trees , 2010, SODA '10.

[16]  Gonzalo Navarro,et al.  Alphabet-Independent Compressed Text Indexing , 2011, ESA.

[17]  Gonzalo Navarro,et al.  Practical Compressed Document Retrieval , 2011, SEA.

[18]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[19]  Alberto Apostolico,et al.  The Myriad Virtues of Subword Trees , 1985 .

[20]  Mark Levene,et al.  Search Engines: Information Retrieval in Practice , 2011, Comput. J..

[21]  Sebastiano Vigna,et al.  Monotone minimal perfect hashing: searching a sorted table with O(1) accesses , 2009, SODA.

[22]  J. Shane Culpepper,et al.  Top-k Ranked Document Search in General Text Databases , 2010, ESA.

[23]  S. Muthukrishnan,et al.  Efficient algorithms for document retrieval problems , 2002, SODA '02.

[24]  Veli Mäkinen,et al.  Space-Efficient Algorithms for Document Retrieval , 2007, CPM.

[25]  Marek Karpinski,et al.  Top-K color queries for document retrieval , 2011, SODA '11.

[26]  Sebastiano Vigna,et al.  Theory and Practise of Monotone Minimal Perfect Hashing , 2009, ALENEX.

[27]  Kunihiko Sadakane,et al.  Succinct data structures for flexible text retrieval systems , 2007, J. Discrete Algorithms.