Space-Efficient Top-k Document Retrieval

Supporting top-k document retrieval queries on general text databases, that is, finding the k documents where a given pattern occurs most frequently, has become a topic of interest with practical applications. While the problem has been solved in optimal time and linear space, the actual space usage is a serious concern. In this paper we study various reduced-space structures that support top-k retrieval and propose new alternatives. Our experimental results show that our novel structures and algorithms dominate almost all the space/time tradeoff.

[1]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[2]  Simon J. Puglisi,et al.  Range Quantile Queries: Another Virtue of Wavelet Trees , 2009, SPIRE.

[3]  Gonzalo Navarro,et al.  Practical Compressed Document Retrieval , 2011, SEA.

[4]  S. Muthukrishnan,et al.  Efficient algorithms for document retrieval problems , 2002, SODA '02.

[5]  Gaston H. Gonnet,et al.  LATIN 2000: Theoretical Informatics: 4th Latin American Symposium, Punta del Este, Uruguay, April 10-14, 2000 Proceedings , 2000, Lecture Notes in Computer Science.

[6]  Gonzalo Navarro,et al.  Improved compressed indexes for full-text document retrieval , 2011, J. Discrete Algorithms.

[7]  Gonzalo Navarro,et al.  Compressed representations of sequences and full-text indexes , 2007, TALG.

[8]  Gonzalo Navarro,et al.  Top-k document retrieval in optimal time and linear space , 2012, SODA.

[9]  S. Srinivasa Rao,et al.  Rank/select operations on large alphabets: a tool for text indexing , 2006, SODA '06.

[10]  Mark de Berg,et al.  Algorithms - ESA 2010, 18th Annual European Symposium, Liverpool, UK, September 6-8, 2010. Proceedings, Part I , 2010, ESA.

[11]  Veli Mäkinen,et al.  Space-Efficient Algorithms for Document Retrieval , 2007, CPM.

[12]  Gonzalo Navarro,et al.  Colored range queries and document retrieval , 2010, Theor. Comput. Sci..

[13]  Wing-Kai Hon,et al.  Inverted indexes for phrases and strings , 2011, SIGIR.

[14]  Guy Jacobson,et al.  Space-efficient static trees and graphs , 1989, 30th Annual Symposium on Foundations of Computer Science.

[15]  Michael A. Bender,et al.  The LCA Problem Revisited , 2000, LATIN.

[16]  Wing-Kai Hon,et al.  Efficient index for retrieving top-k most frequent documents , 2010, J. Discrete Algorithms.

[17]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[18]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[19]  Wojciech Rytter,et al.  Extracting Powers and Periods in a String from Its Runs Structure , 2010, SPIRE.

[20]  Wing-Kai Hon,et al.  Towards an Optimal Space-and-Query-Time Index for Top-k Document Retrieval , 2012, CPM.

[21]  Alistair Moffat,et al.  Off-line dictionary-based compression , 1999, Proceedings of the IEEE.

[22]  Wing-Kai Hon,et al.  Space-Efficient Framework for Top-k String Retrieval Problems , 2009, 2009 50th Annual IEEE Symposium on Foundations of Computer Science.

[23]  Gonzalo Navarro,et al.  Succinct Trees in Practice , 2010, ALENEX.

[24]  Kunihiko Sadakane,et al.  Succinct data structures for flexible text retrieval systems , 2007, J. Discrete Algorithms.

[25]  J. Shane Culpepper,et al.  Top-k Ranked Document Search in General Text Databases , 2010, ESA.