Lempel-Ziv compressed structures for document retrieval

Abstract Document retrieval structures index a collection of string documents, to retrieve those that are relevant to query strings p: document listing retrieves all documents where p appears; top-k retrieval retrieves the k most relevant of those. Classical structures use too much space in practice. Most current research uses compressed suffix arrays, but fast indices still use 17–21 bpc (bits per character), whereas small ones take milliseconds per returned answer. We present the first document retrieval structures based on Lempel–Ziv compression, precisely LZ78. Our structures use 7–10 bpc and dominate a large part of the space/time tradeoffs. They also enable more efficient partial or approximate answers: our document listing outputs the first 75%–80% of the answers at a rate of one per microsecond; for top-k retrieval we return a result of 90% quality at the same rate and using just 4–6 bpc. This outperforms current indices by a wide margin.

[1]  Rajeev Raman,et al.  Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets , 2007, ACM Trans. Algorithms.

[2]  S. Muthukrishnan,et al.  Efficient algorithms for document retrieval problems , 2002, SODA '02.

[3]  Kunihiko Sadakane,et al.  Fully Functional Static and Dynamic Succinct Trees , 2009, TALG.

[4]  Kurt Mehlhorn,et al.  Data Structures and Algorithms 1: Sorting and Searching , 2011, EATCS Monographs on Theoretical Computer Science.

[5]  J. Ian Munro,et al.  Succinct Representation of Balanced Parentheses and Static Trees , 2002, SIAM J. Comput..

[6]  Volker Heun,et al.  Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays , 2011, SIAM J. Comput..

[7]  Veli Mäkinen,et al.  Space-Efficient Algorithms for Document Retrieval , 2007, CPM.

[8]  Giovanni Manzini,et al.  Compression of low entropy strings with Lempel-Ziv algorithms , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[9]  Gonzalo Navarro,et al.  Faster Compact Top-k Document Retrieval , 2012, 2013 Data Compression Conference.

[10]  Gonzalo Navarro,et al.  Efficient Compressed Indexing for Approximate Top-k String Retrieval , 2014, SPIRE.

[11]  Kunihiko Sadakane,et al.  Succinct data structures for flexible text retrieval systems , 2007, J. Discrete Algorithms.

[12]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[13]  Guy Jacobson,et al.  Space-efficient static trees and graphs , 1989, 30th Annual Symposium on Foundations of Computer Science.

[14]  Gonzalo Navarro,et al.  Improved compressed indexes for full-text document retrieval , 2013, J. Discrete Algorithms.

[15]  Wojciech Szpankowski,et al.  On the height of digital trees and related problems , 1991, Algorithmica.

[16]  Gonzalo Navarro,et al.  On compressing and indexing repetitive sequences , 2013, Theor. Comput. Sci..

[17]  Gonzalo Navarro,et al.  A Lempel-Ziv Compressed Structure for Document Listing , 2013, SPIRE.

[18]  Giovanni Manzini,et al.  An analysis of the Burrows-Wheeler transform , 2001, SODA '99.

[19]  Gonzalo Navarro,et al.  Implementing the LZ-index: Theory versus practice , 2009, JEAL.

[20]  Giovanni Manzini,et al.  Indexing compressed text , 2005, JACM.

[21]  Gonzalo Navarro,et al.  Improved Single-Term Top-k Document Retrieval , 2015, ALENEX.

[22]  Rajeev Raman,et al.  Representing Trees of Higher Degree , 2005, Algorithmica.

[23]  Luís M. S. Russo,et al.  Space-efficient data-analysis queries on grids , 2013, Theor. Comput. Sci..

[24]  Wing-Kai Hon,et al.  Space-Efficient Frameworks for Top-k String Retrieval , 2014, J. ACM.

[25]  Paolo Ferragina,et al.  Indexing compressed text , 2005, JACM.

[26]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[27]  Gonzalo Navarro,et al.  Compressed representations of sequences and full-text indexes , 2007, TALG.

[28]  Gonzalo Navarro,et al.  General Document Retrieval in Compact Space , 2015, ACM J. Exp. Algorithmics.

[29]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[30]  Jimmy J. Lin,et al.  A cascade ranking model for efficient ranked retrieval , 2011, SIGIR.

[31]  Gonzalo Navarro,et al.  Succinct Trees in Practice , 2010, ALENEX.

[32]  David Thomas,et al.  The Art in Computer Programming , 2001 .

[33]  Luís M. S. Russo,et al.  A compressed self-index using a Ziv–Lempel dictionary , 2006, Information Retrieval.

[34]  J. Shane Culpepper,et al.  Top-k Ranked Document Search in General Text Databases , 2010, ESA.

[35]  Edward Fredkin,et al.  Trie memory , 1960, Commun. ACM.

[36]  Rajeev Raman,et al.  Succinct indexable dictionaries with applications to encoding k-ary trees and multisets , 2002, SODA '02.

[37]  Gonzalo Navarro,et al.  Indexing text using the Ziv-Lempel trie , 2002, J. Discrete Algorithms.

[38]  K. Satchidanandan,et al.  Spaces , 2018, Fashion as Cultural Translation.

[39]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..

[40]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[41]  Gonzalo Navarro,et al.  Improved Range Minimum Queries , 2016, 2016 Data Compression Conference (DCC).

[42]  Gonzalo Navarro,et al.  Stronger Lempel-Ziv Based Compressed Text Indexing , 2012, Algorithmica.

[43]  Gonzalo Navarro,et al.  Spaces, Trees, and Colors , 2013, ACM Comput. Surv..

[44]  Gonzalo Navarro,et al.  New algorithms on wavelet trees and applications to information retrieval , 2010, Theor. Comput. Sci..

[45]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[46]  David Richard Clark,et al.  Compact pat trees , 1998 .

[47]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[48]  Gonzalo Navarro,et al.  Top-k document retrieval in optimal time and linear space , 2012, SODA.

[49]  Wing-Kai Hon,et al.  Space-Efficient Framework for Top-k String Retrieval Problems , 2009, 2009 50th Annual IEEE Symposium on Foundations of Computer Science.

[50]  Kunihiko Sadakane,et al.  New text indexing functionalities of the compressed suffix arrays , 2003, J. Algorithms.

[51]  Nicole Bauer,et al.  Information Retrieval Implementing And Evaluating Search Engines , 2016 .