Towards an Optimal Space-and-Query-Time Index for Top-k Document Retrieval

Let $\cal{D} = $ {d1,d2,...dD} be a given set of D string documents of total length n, our task is to index $\cal{D}$, such that the k most relevant documents for an online query pattern P of length p can be retrieved efficiently. We propose an index of size |CSA|+nlogD(2+o(1)) bits and O(ts(p)+kloglogn+polyloglogn) query time for the basic relevance metric term-frequency, where |CSA| is the size (in bits) of a compressed full text index of $\cal{D}$, with O(ts(p)) time for searching a pattern of length p. We further reduce the space to |CSA|+nlogD(1+o(1)) bits, however the query time will be O(ts(p)+k(logσloglogn)1+e+polyloglogn), where σ is the alphabet size and e>0 is any constant.

[1]  Veli Mäkinen,et al.  Space-Efficient Algorithms for Document Retrieval , 2007, CPM.

[2]  S. Srinivasa Rao,et al.  Rank/select operations on large alphabets: a tool for text indexing , 2006, SODA '06.

[3]  Xerox Polo,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976 .

[4]  Wing-Kai Hon,et al.  Space-Efficient Framework for Top-k String Retrieval Problems , 2009, 2009 50th Annual IEEE Symposium on Foundations of Computer Science.

[5]  Gonzalo Navarro,et al.  Colored range queries and document retrieval , 2010, Theor. Comput. Sci..

[6]  Wing-Kai Hon,et al.  String Retrieval for Multi-pattern Queries , 2010, SPIRE.

[7]  Gonzalo Navarro,et al.  Space-Efficient Top-k Document Retrieval , 2012, SEA.

[8]  J. Shane Culpepper,et al.  Top-k Ranked Document Search in General Text Databases , 2010, ESA.

[9]  Gonzalo Navarro,et al.  Dual-Sorted Inverted Lists , 2010, SPIRE.

[10]  Gonzalo Navarro,et al.  Practical Compressed Document Retrieval , 2011, SEA.

[11]  Gonzalo Navarro,et al.  Alphabet-Independent Compressed Text Indexing , 2011, TALG.

[12]  Kunihiko Sadakane,et al.  Ultra-succinct representation of ordered trees , 2007, SODA '07.

[13]  Simon J. Puglisi,et al.  Range Quantile Queries: Another Virtue of Wavelet Trees , 2009, SPIRE.

[14]  Wing-Kai Hon,et al.  Inverted indexes for phrases and strings , 2011, SIGIR.

[15]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[16]  Z. Galil,et al.  Pattern matching algorithms , 1997 .

[17]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[18]  Gonzalo Navarro,et al.  Improved compressed indexes for full-text document retrieval , 2011, J. Discrete Algorithms.

[19]  Rajeev Raman,et al.  Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets , 2007, ACM Trans. Algorithms.

[20]  Rajeev Raman,et al.  Optimal Trade-Offs for Succinct String Indexes , 2010, ICALP.

[21]  Dan E. Willard,et al.  Log-logarithmic worst-case range queries are possible in space ⊕(N) , 1983 .

[22]  Johannes Fischer,et al.  Optimal Succinctness for Range Minimum Queries , 2008, LATIN.

[23]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[24]  Manuel Blum,et al.  Time Bounds for Selection , 1973, J. Comput. Syst. Sci..

[25]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..

[26]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[27]  Greg N. Frederickson,et al.  An Optimal Algorithm for Selection in a Min-Heap , 1993, Inf. Comput..

[28]  Kunihiko Sadakane,et al.  Succinct data structures for flexible text retrieval systems , 2007, J. Discrete Algorithms.

[29]  Siu-Ming Yiu,et al.  SOAP2: an improved ultrafast tool for short read alignment , 2009, Bioinform..

[30]  Yossi Matias,et al.  Augmenting Suffix Trees, with Applications , 1998, ESA.

[31]  Gonzalo Navarro,et al.  Top-k document retrieval in optimal time and linear space , 2012, SODA.

[32]  Gonzalo Navarro,et al.  Fully-functional succinct trees , 2010, SODA '10.

[33]  Wing-Kai Hon,et al.  Efficient index for retrieving top-k most frequent documents , 2010, J. Discrete Algorithms.

[34]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[35]  Gonzalo Navarro,et al.  Compressed representations of sequences and full-text indexes , 2007, TALG.

[36]  Kunihiko Sadakane,et al.  Compressed Suffix Trees with Full Functionality , 2007, Theory of Computing Systems.

[37]  Marek Karpinski,et al.  Top-K color queries for document retrieval , 2011, SODA '11.

[38]  S. Muthukrishnan,et al.  Efficient algorithms for document retrieval problems , 2002, SODA '02.

[39]  Wing-Kai Hon,et al.  Compression, Indexing, and Retrieval for Massive String Data , 2010, CPM.

[40]  Giovanni Manzini,et al.  Indexing compressed text , 2005, JACM.