Distribution-Aware Compressed Full-Text Indexes

In this paper we address the problem of building a compressed self-index that, given a distribution for the pattern queries and a bound on the space occupancy, minimizes the expected query-time within that index-space bound. We solve this problem by exploiting a reduction to the problem of finding a minimum weight K-link path in a particular Directed Acyclic Graph. Interestingly enough, our solution is independent of the underlying compressed index in use. Our experiments compare this optimal strategy with several other standard approaches, showing its effectiveness in practice.

[1]  Gonzalo Navarro,et al.  Storage and Retrieval of Highly Repetitive Sequence Collections , 2010, J. Comput. Biol..

[2]  Raffaele Giancarlo Dynamic programming: special cases , 1997, Pattern Matching Algorithms.

[3]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[4]  Gonzalo Navarro,et al.  Alphabet-Independent Compressed Text Indexing , 2011, ESA.

[5]  Jouni Sirén,et al.  Compressed Full-Text Indexes for Highly Repetitive Collections , 2012 .

[6]  Robert E. Wilber The Concave Least-Weight Subsequence Problem Revisited , 1988, J. Algorithms.

[7]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[8]  ManziniGiovanni An analysis of the BurrowsWheeler transform , 2001 .

[9]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[10]  Giovanni Manzini,et al.  An analysis of the Burrows-Wheeler transform , 2001, SODA '99.

[11]  Alok Aggarwal,et al.  Finding a minimum-weightk-link path in graphs with the concave Monge property and applications , 1994, Discret. Comput. Geom..

[12]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..

[13]  Roberto Grossi,et al.  Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract) , 2000, STOC '00.

[14]  Giovanni Manzini,et al.  On compressing the textual web , 2010, WSDM '10.

[15]  Lawrence L. Larmore,et al.  The least weight subsequence problem , 1987, 26th Annual Symposium on Foundations of Computer Science (sfcs 1985).

[16]  Juha Kärkkäinen,et al.  Fixed Block Compression Boosting in FM-Indexes , 2011, SPIRE.

[17]  Fabrizio Silvestri,et al.  Mining Query Logs: Turning Search Usage Data into Knowledge , 2010, Found. Trends Inf. Retr..

[18]  Gonzalo Navarro,et al.  Alphabet Partitioning for Compressed Rank/Select and Applications , 2010, ISAAC.

[19]  Baruch Schieber,et al.  Computing a minimum-weight k-link path in graphs with the concave Monge property , 1995, SODA '95.

[20]  Rodrigo González,et al.  Compressed text indexes: From theory to practice , 2007, JEAL.

[21]  Kunihiko Sadakane,et al.  New text indexing functionalities of the compressed suffix arrays , 2003, J. Algorithms.

[22]  Giovanni Manzini,et al.  Indexing compressed text , 2005, JACM.

[23]  Torben Hagerup,et al.  Efficient Minimal Perfect Hashing in Nearly Minimal Space , 2001, STACS.