Fast Dictionary-Based Compression for Inverted Indexes

Dictionary-based compression schemes provide fast decoding operation, typically at the expense of reduced compression effectiveness compared to statistical or probability-based approaches. In this work, we apply dictionary-based techniques to the compression of inverted lists, showing that the high degree of regularity that these integer sequences exhibit is a good match for certain types of dictionary methods, and that an important new trade-off balance between compression effectiveness and compression efficiency can be achieved. Our observations are supported by experiments using the document-level inverted index data for two large text collections, and a wide range of other index compression implementations as reference points. Those experiments demonstrate that the gap between efficiency and effectiveness can be substantially narrowed.

[1]  A. Apostolico,et al.  Off-line compression by greedy textual substitution , 2000, Proceedings of the IEEE.

[2]  James A. Storer,et al.  Data compression via textual substitution , 1982, JACM.

[3]  Alistair Moffat,et al.  Inverted Index Compression Using Word-Aligned Binary Codes , 2004, Information Retrieval.

[4]  Giuseppe Ottaviano,et al.  Compressing Graphs and Indexes with Recursive Graph Bisection , 2016, KDD.

[5]  Andrew Trotman,et al.  Compressing Inverted Files , 2004, Information Retrieval.

[6]  Alistair Moffat,et al.  Compact inverted index storage using general‐purpose compression libraries , 2018, Softw. Pract. Exp..

[7]  Marcin Zukowski,et al.  Super-Scalar RAM-CPU Cache Compression , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[8]  Leonid Boytsov,et al.  Decoding billions of integers per second through vectorization , 2012, Softw. Pract. Exp..

[9]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[10]  Yannis Papakonstantinou,et al.  MILC: Inverted List Compression in Memory , 2017, Proc. VLDB Endow..

[11]  Alistair Moffat,et al.  A Cost Model for Long-Term Compressed Data Retention , 2017, WSDM.

[12]  Kim-Hung Li,et al.  Reservoir-sampling algorithms of time complexity O(n(1 + log(N/n))) , 1994, TOMS.

[13]  Giuseppe Ottaviano,et al.  Partitioned Elias-Fano indexes , 2014, SIGIR.

[14]  Alistair Moffat,et al.  Binary Interpolative Coding for Effective Index Compression , 2000, Information Retrieval.

[15]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[16]  Gonzalo Navarro,et al.  Re-Pair Compression of Inverted Lists , 2009, ArXiv.

[17]  Daniel Lemire,et al.  Vectorized VByte Decoding , 2015, ArXiv.

[18]  Joan Serra-Sagristà,et al.  Marlin: A High Throughput Variable-to-Fixed Codec Using Plurally Parsable Dictionaries , 2017, 2017 Data Compression Conference (DCC).

[19]  Hugh E. Williams,et al.  Compressing Integers for Fast File Access , 1999, Comput. J..

[20]  Gonzalo Navarro,et al.  (S, C)-Dense Coding: An Optimized Compression Code for Natural Language Text Databases , 2003, SPIRE.

[21]  Andrew Trotman Compression, SIMD, and Postings Lists , 2014, ADCS '14.

[22]  Torsten Suel,et al.  Performance of compressed inverted list caching in search engines , 2008, WWW.

[23]  Gang Wang,et al.  Leveraging Context-Free Grammar for Efficient Inverted Index Compression , 2016, SIGIR.

[24]  Tao Jiang,et al.  Linear approximation of shortest superstrings , 1991, STOC '91.

[25]  Justin Zobel,et al.  Relative Lempel-Ziv Factorization for Efficient Storage and Retrieval of Web Collections , 2011, Proc. VLDB Endow..

[26]  Alistair Moffat,et al.  Effective Construction of Relative Lempel-Ziv Dictionaries , 2016, WWW.

[27]  A. Moffat,et al.  Offline dictionary-based compression , 2000, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[28]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[29]  Haim Kaplan,et al.  The greedy algorithm for shortest superstrings , 2005, Inf. Process. Lett..

[30]  J. Shane Culpepper,et al.  Enhanced Byte Codes with Restricted Prefix Properties , 2005, SPIRE.

[31]  Alistair Moffat,et al.  ANS-Based Index Compression , 2017, CIKM.

[32]  Alistair Moffat,et al.  Index compression using 64‐bit words , 2010, Softw. Pract. Exp..

[33]  Torsten Suel,et al.  Inverted index compression and query processing with optimized document ordering , 2009, WWW '09.

[34]  Alistair Moffat,et al.  Off-line dictionary-based compression , 2000 .

[35]  Alistair Moffat,et al.  Index Compression Using Byte-Aligned ANS Coding and Two-Dimensional Contexts , 2018, WSDM.

[36]  Gonzalo Navarro,et al.  Reorganizing compressed text , 2008, SIGIR '08.

[37]  Rossano Venturini,et al.  Inverted Index Compression , 2019, Encyclopedia of Big Data Technologies.

[38]  Giuseppe Ottaviano,et al.  Optimal Space-time Tradeoffs for Inverted Indexes , 2015, WSDM.

[39]  Rossano Venturini,et al.  Clustered Elias-Fano Indexes , 2017, ACM Trans. Inf. Syst..

[40]  David Maier,et al.  On Finding Minimal Length Superstrings , 1980, J. Comput. Syst. Sci..

[41]  J. Shane Culpepper,et al.  Efficient set intersection for inverted indexing , 2010, TOIS.

[42]  Alexander A. Stepanov,et al.  SIMD-based decoding of posting lists , 2011, CIKM '11.

[43]  Jeffrey Dean,et al.  Challenges in building large-scale information retrieval systems: invited talk , 2009, WSDM '09.

[44]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[45]  Daniel Lemire,et al.  Stream VByte: Faster byte-oriented integer compression , 2017, Inf. Process. Lett..

[46]  Frank Wm. Tompa,et al.  Skewed partial bitvectors for list intersection , 2014, SIGIR.