论文信息 - Fast Dictionary-Based Compression for Inverted Indexes

Fast Dictionary-Based Compression for Inverted Indexes

Dictionary-based compression schemes provide fast decoding operation, typically at the expense of reduced compression effectiveness compared to statistical or probability-based approaches. In this work, we apply dictionary-based techniques to the compression of inverted lists, showing that the high degree of regularity that these integer sequences exhibit is a good match for certain types of dictionary methods, and that an important new trade-off balance between compression effectiveness and compression efficiency can be achieved. Our observations are supported by experiments using the document-level inverted index data for two large text collections, and a wide range of other index compression implementations as reference points. Those experiments demonstrate that the gap between efficiency and effectiveness can be substantially narrowed.

[1] A. Apostolico,et al. Off-line compression by greedy textual substitution , 2000, Proceedings of the IEEE.

[2] James A. Storer,et al. Data compression via textual substitution , 1982, JACM.

[3] Alistair Moffat,et al. Inverted Index Compression Using Word-Aligned Binary Codes , 2004, Information Retrieval.

[4] Giuseppe Ottaviano,et al. Compressing Graphs and Indexes with Recursive Graph Bisection , 2016, KDD.

[5] Andrew Trotman,et al. Compressing Inverted Files , 2004, Information Retrieval.

[6] Alistair Moffat,et al. Compact inverted index storage using general‐purpose compression libraries , 2018, Softw. Pract. Exp..

[7] Marcin Zukowski,et al. Super-Scalar RAM-CPU Cache Compression , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[8] Leonid Boytsov,et al. Decoding billions of integers per second through vectorization , 2012, Softw. Pract. Exp..

[9] Ian H. Witten,et al. Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[10] Yannis Papakonstantinou,et al. MILC: Inverted List Compression in Memory , 2017, Proc. VLDB Endow..

[11] Alistair Moffat,et al. A Cost Model for Long-Term Compressed Data Retention , 2017, WSDM.

[12] Kim-Hung Li,et al. Reservoir-sampling algorithms of time complexity O(n(1 + log(N/n))) , 1994, TOMS.

[13] Giuseppe Ottaviano,et al. Partitioned Elias-Fano indexes , 2014, SIGIR.

[14] Alistair Moffat,et al. Binary Interpolative Coding for Effective Index Compression , 2000, Information Retrieval.

[15] Jeffrey Scott Vitter,et al. Random sampling with a reservoir , 1985, TOMS.

[16] Gonzalo Navarro,et al. Re-Pair Compression of Inverted Lists , 2009, ArXiv.

[17] Daniel Lemire,et al. Vectorized VByte Decoding , 2015, ArXiv.

[18] Joan Serra-Sagristà,et al. Marlin: A High Throughput Variable-to-Fixed Codec Using Plurally Parsable Dictionaries , 2017, 2017 Data Compression Conference (DCC).

[19] Hugh E. Williams,et al. Compressing Integers for Fast File Access , 1999, Comput. J..

[20] Gonzalo Navarro,et al. (S, C)-Dense Coding: An Optimized Compression Code for Natural Language Text Databases , 2003, SPIRE.

[21] Andrew Trotman. Compression, SIMD, and Postings Lists , 2014, ADCS '14.

[22] Torsten Suel,et al. Performance of compressed inverted list caching in search engines , 2008, WWW.

[23] Gang Wang,et al. Leveraging Context-Free Grammar for Efficient Inverted Index Compression , 2016, SIGIR.

[24] Tao Jiang,et al. Linear approximation of shortest superstrings , 1991, STOC '91.

[25] Justin Zobel,et al. Relative Lempel-Ziv Factorization for Efficient Storage and Retrieval of Web Collections , 2011, Proc. VLDB Endow..

[26] Alistair Moffat,et al. Effective Construction of Relative Lempel-Ziv Dictionaries , 2016, WWW.

[27] A. Moffat,et al. Offline dictionary-based compression , 2000, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[28] JUSTIN ZOBEL,et al. Inverted files for text search engines , 2006, CSUR.

[29] Haim Kaplan,et al. The greedy algorithm for shortest superstrings , 2005, Inf. Process. Lett..

[30] J. Shane Culpepper,et al. Enhanced Byte Codes with Restricted Prefix Properties , 2005, SPIRE.

[31] Alistair Moffat,et al. ANS-Based Index Compression , 2017, CIKM.

[32] Alistair Moffat,et al. Index compression using 64‐bit words , 2010, Softw. Pract. Exp..

[33] Torsten Suel,et al. Inverted index compression and query processing with optimized document ordering , 2009, WWW '09.

[34] Alistair Moffat,et al. Off-line dictionary-based compression , 2000 .

[35] Alistair Moffat,et al. Index Compression Using Byte-Aligned ANS Coding and Two-Dimensional Contexts , 2018, WSDM.

[36] Gonzalo Navarro,et al. Reorganizing compressed text , 2008, SIGIR '08.

[37] Rossano Venturini,et al. Inverted Index Compression , 2019, Encyclopedia of Big Data Technologies.

[38] Giuseppe Ottaviano,et al. Optimal Space-time Tradeoffs for Inverted Indexes , 2015, WSDM.

[39] Rossano Venturini,et al. Clustered Elias-Fano Indexes , 2017, ACM Trans. Inf. Syst..

[40] David Maier,et al. On Finding Minimal Length Superstrings , 1980, J. Comput. Syst. Sci..

[41] J. Shane Culpepper,et al. Efficient set intersection for inverted indexing , 2010, TOIS.

[42] Alexander A. Stepanov,et al. SIMD-based decoding of posting lists , 2011, CIKM '11.

[43] Jeffrey Dean,et al. Challenges in building large-scale information retrieval systems: invited talk , 2009, WSDM '09.

[44] Ian H. Witten,et al. Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[45] Daniel Lemire,et al. Stream VByte: Faster byte-oriented integer compression , 2017, Inf. Process. Lett..

[46] Frank Wm. Tompa,et al. Skewed partial bitvectors for list intersection , 2014, SIGIR.