Document Listing on Repetitive Collections

Many document collections consist largely of repeated material, and several indexes have been designed to take advantage of this. There has been only preliminary work, however, on document retrieval for repetitive collections. In this paper we show how one of those indexes, the run-length compressed suffix array (RLCSA), can be extended to support document listing. In our experiments, our additional structures on top of the RLCSA can reduce the query time for document listing by an order of magnitude while still using total space that is only a fraction of the raw collection size. As a byproduct, we develop a new document listing technique for general collections that is of independent interest.

[1]  S. Muthukrishnan,et al.  Efficient algorithms for document retrieval problems , 2002, SODA '02.

[2]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[3]  Gonzalo Navarro,et al.  Colored range queries and document retrieval , 2013, Theor. Comput. Sci..

[4]  Camil Demetrescu,et al.  Algorithms – ESA 2011 , 2011, Lecture Notes in Computer Science.

[5]  Wing-Kai Hon,et al.  Space-Efficient Framework for Top-k String Retrieval Problems , 2009, 2009 50th Annual IEEE Symposium on Foundations of Computer Science.

[6]  Dan E. Willard,et al.  Log-logarithmic worst-case range queries are possible in space ⊕(N) , 1983 .

[7]  Kunihiko Sadakane,et al.  Succinct data structures for flexible text retrieval systems , 2007, J. Discrete Algorithms.

[8]  Juha Kärkkäinen,et al.  A Faster Grammar-Based Self-index , 2011, LATA.

[9]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[10]  Wojciech Rytter,et al.  Extracting Powers and Periods in a String from Its Runs Structure , 2010, SPIRE.

[11]  Alejandro López-Ortiz LATIN 2010: Theoretical Informatics, 9th Latin American Symposium, Oaxaca, Mexico, April 19-23, 2010. Proceedings , 2010, Lecture Notes in Computer Science.

[12]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[13]  Gonzalo Navarro,et al.  New algorithms on wavelet trees and applications to information retrieval , 2010, Theor. Comput. Sci..

[14]  Kunihiko Sadakane,et al.  Practical Entropy-Compressed Rank/Select Dictionary , 2006, ALENEX.

[15]  Wojciech Rytter,et al.  On the Maximal Number of Cubic Runs in a String , 2010, LATA.

[16]  Wojciech Szpankowski,et al.  A Generalized Suffix Tree and its (Un)expected Asymptotic Behaviors , 1993, SIAM J. Comput..

[17]  Johannes Fischer,et al.  Optimal Succinctness for Range Minimum Queries , 2008, LATIN.

[18]  Veli Mäkinen,et al.  Space-Efficient Algorithms for Document Retrieval , 2007, CPM.

[19]  Gonzalo Navarro,et al.  Improved Grammar-Based Compressed Indexes , 2012, SPIRE.

[20]  Gonzalo Navarro,et al.  Storage and Retrieval of Highly Repetitive Sequence Collections , 2010, J. Comput. Biol..

[21]  Gonzalo Navarro,et al.  Alphabet-Independent Compressed Text Indexing , 2011, ESA.

[22]  Gonzalo Navarro,et al.  Practical Compressed Document Retrieval , 2011, SEA.