Universal indexes for highly repetitive document collections

Indexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These collections may reach huge sizes, but are formed mostly of documents that are near-copies of others. Traditional techniques for indexing these collections fail to properly exploit their regularities in order to reduce space. We introduce new techniques for compressing inverted indexes that exploit this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar compression of the differential inverted lists, instead of the usual practice of gap-encoding them. We show that, in this highly repetitive setting, our compression methods significantly reduce the space obtained with classical techniques, at the price of moderate slowdowns. Moreover, our best methods are universal, that is, they do not need to know the versioning structure of the collection, nor that a clear versioning structure even exists. We also introduce compressed self-indexes in the comparison. These are designed for general strings (not only natural language texts) and represent the text collection plus the index structure (not an inverted index) in integrated form. We show that these techniques can compress much further, using a small fraction of the space required by our new inverted indexes. Yet, they are orders of magnitude slower.

[1]  W. Bruce Croft,et al.  Efficient document retrieval in main memory , 2007, SIGIR.

[2]  Charles L. A. Clarke,et al.  Information Retrieval - Implementing and Evaluating Search Engines , 2010 .

[3]  S. Héman Super-Scalar Database Compression between RAM and CPU Cache , 2005 .

[4]  Gonzalo Navarro,et al.  Storage and Retrieval of Highly Repetitive Sequence Collections , 2010, J. Comput. Biol..

[5]  Kunihiko Sadakane,et al.  New text indexing functionalities of the compressed suffix arrays , 2003, J. Algorithms.

[6]  Gonzalo Navarro,et al.  Practical Rank/Select Queries over Arbitrary Sequences , 2008, SPIRE.

[7]  Nieves R. Brisaboa,et al.  Practical compressed string dictionaries , 2016, Inf. Syst..

[8]  Torsten Suel,et al.  Compact full-text indexing of versioned document collections , 2009, CIKM.

[9]  Torsten Suel,et al.  Faster top-k document retrieval using block-max indexes , 2011, SIGIR.

[10]  Ricardo A. Baeza-Yates,et al.  Fast and flexible word searching on compressed text , 2000, TOIS.

[11]  Wojciech Rytter,et al.  Application of Lempel-Ziv factorization to the approximation of grammar-based compression , 2002, Theor. Comput. Sci..

[12]  SandersPeter,et al.  Engineering basic algorithms of an in-memory text search engine , 2010 .

[13]  Alistair Moffat,et al.  Off-line dictionary-based compression , 1999, Proceedings of the IEEE.

[14]  Juha Kärkkäinen,et al.  A Faster Grammar-Based Self-index , 2011, LATA.

[15]  Wolfgang Lehner,et al.  Fast integer compression using SIMD instructions , 2010, DaMoN '10.

[16]  Torsten Suel,et al.  Improved index compression techniques for versioned document collections , 2010, CIKM '10.

[17]  Hiroshi Sakamoto,et al.  A fully linear-time approximation algorithm for grammar-based compression , 2003, J. Discrete Algorithms.

[18]  Erik D. Demaine,et al.  Adaptive set intersections, unions, and differences , 2000, SODA '00.

[19]  Alistair Moffat,et al.  Searching large text collections , 2002 .

[20]  Leonid Boytsov,et al.  SIMD compression and the intersection of sorted integers , 2014, Softw. Pract. Exp..

[21]  Rajeev Raman,et al.  Representing Trees of Higher Degree , 2005, Algorithmica.

[22]  Gonzalo Navarro,et al.  Self-Indexed Grammar-Based Compression , 2011, Fundam. Informaticae.

[23]  Giuseppe Ottaviano,et al.  Partitioned Elias-Fano indexes , 2014, SIGIR.

[24]  Gonzalo Navarro,et al.  Document Retrieval on Repetitive Collections , 2014, ESA.

[25]  Andrew Trotman Compression, SIMD, and Postings Lists , 2014, ADCS '14.

[26]  Mathieu Raffinot,et al.  Composite Repetition-Aware Data Structures , 2015, CPM.

[27]  Torsten Suel,et al.  Scalable techniques for document identifier assignment in inverted indexes , 2010, WWW '10.

[28]  Gonzalo Navarro,et al.  Word-based self-indexes for natural language text , 2012, TOIS.

[29]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[30]  Torsten Suel,et al.  Performance of compressed inverted list caching in search engines , 2008, WWW.

[31]  Marcin Zukowski,et al.  Super-Scalar RAM-CPU Cache Compression , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[32]  Ricardo A. Baeza-Yates,et al.  A Fast Set Intersection Algorithm for Sorted Sequences , 2004, CPM.

[33]  Abhi Shelat,et al.  The smallest grammar problem , 2005, IEEE Transactions on Information Theory.

[34]  Peter Sanders,et al.  Engineering basic algorithms of an in-memory text search engine , 2010, TOIS.

[35]  Torsten Suel,et al.  Efficient search in large textual collections with redundancy , 2007, WWW '07.

[36]  Claire Mathieu,et al.  Adaptive intersection and t-threshold problems , 2002, SODA '02.

[37]  S. Srinivasa Rao,et al.  Rank/select operations on large alphabets: a tool for text indexing , 2006, SODA '06.

[38]  Rajeev Raman,et al.  Succinct Representations of Permutations , 2003, ICALP.

[39]  Wing-Kai Hon,et al.  On position restricted substring searching in succinct space , 2012, J. Discrete Algorithms.

[40]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[41]  Gonzalo Navarro,et al.  Improved Grammar-Based Compressed Indexes , 2012, SPIRE.

[42]  Peter G. Anick,et al.  Versioning a full-text information retrieval system , 1992, SIGIR '92.

[43]  Rodrigo González,et al.  Locally Compressed Suffix Arrays , 2015, ACM J. Exp. Algorithmics.

[44]  Torsten Suel,et al.  Inverted index compression and query processing with optimized document ordering , 2009, WWW '09.

[45]  Alistair Moffat,et al.  Index compression using 64‐bit words , 2010, Softw. Pract. Exp..

[46]  Alistair Moffat,et al.  Binary Interpolative Coding for Effective Index Compression , 2000, Information Retrieval.

[47]  J. Shane Culpepper,et al.  Efficient set intersection for inverted indexing , 2010, TOIS.

[48]  Andrei Z. Broder,et al.  Indexing Shared Content in Information Retrieval Systems , 2006, EDBT.

[49]  Torsten Suel,et al.  Optimizing positional index structures for versioned document collections , 2012, SIGIR '12.

[50]  Gonzalo Navarro,et al.  Document Listing on Repetitive Collections , 2013, CPM.

[51]  Gonzalo Navarro,et al.  DACs: Bringing direct access to variable-length codes , 2013, Inf. Process. Manag..

[52]  Ian H. Witten,et al.  Managing gigabytes , 1994 .

[53]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[54]  Alistair Moffat,et al.  Inverted Index Compression Using Word-Aligned Binary Codes , 2004, Information Retrieval.

[55]  Alexander A. Stepanov,et al.  SIMD-based decoding of posting lists , 2011, CIKM '11.

[56]  Donald R. Morrison,et al.  PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric , 1968, J. ACM.

[57]  Marc J. Rochkind,et al.  The source code control system , 1975, IEEE Transactions on Software Engineering.

[58]  Alejandro López-Ortiz,et al.  An experimental investigation of set intersection algorithms for text searching , 2010, JEAL.

[59]  Torsten Suel,et al.  Compressing term positions in web indexes , 2009, SIGIR.

[60]  Torsten Suel,et al.  Optimizing top-k document retrieval strategies for block-max indexes , 2013, WSDM.

[61]  Hugh E. Williams,et al.  Compressing Integers for Fast File Access , 1999, Comput. J..

[62]  Gonzalo Navarro,et al.  On compressing and indexing repetitive sequences , 2013, Theor. Comput. Sci..

[63]  Charles L. A. Clarke,et al.  Faster and smaller inverted indices with treaps , 2013, SIGIR.

[64]  David Richard Clark,et al.  Compact pat trees , 1998 .

[65]  J. Ian Munro,et al.  Document Listing on Versioned Documents , 2013, SPIRE.

[66]  Kunihiko Sadakane,et al.  Fast relative Lempel-Ziv self-index for similar sequences , 2014, Theor. Comput. Sci..

[67]  Miguel A. Martínez-Prieto,et al.  Compressed q-Gram Indexing for Highly Repetitive Biological Sequences , 2010, 2010 IEEE International Conference on BioInformatics and BioEngineering.

[68]  R. González,et al.  PRACTICAL IMPLEMENTATION OF RANK AND SELECT QUERIES , 2005 .

[69]  Tien-Fu Chen,et al.  Inverted file compression through document identifier reassignment , 2003, Inf. Process. Manag..

[70]  Juha Kärkkäinen,et al.  LZ77-Based Self-indexing with Faster Pattern Matching , 2014, LATIN.