ANS-Based Index Compression

Techniques for effectively representing the postings lists associated with inverted indexes have been studied for many years. Here we combine the recently developed "asymmetric numeral systems" (ANS) approach to entropy coding and a range of previous index compression methods, including VByte, Simple, and Packed. The ANS mechanism allows each of them to provide markedly improved compression effectiveness, at the cost of slower decoding rates. Using the 426 GB GOV2 collection, we show that the combination of blocking and ANS-based entropy-coding against a set of 16 magnitude-based probability models yields compression effectiveness superior to most previous mechanisms, while still providing reasonable decoding speed.

[1]  Leonid Boytsov,et al.  Decoding billions of integers per second through vectorization , 2012, Softw. Pract. Exp..

[2]  Alistair Moffat,et al.  Index compression using 64-bit words , 2010 .

[3]  Gonzalo Navarro,et al.  Reorganizing compressed text , 2008, SIGIR '08.

[4]  Jarek Duda,et al.  Asymmetric numeral systems: entropy coding combining speed of Huffman coding with compression rate of arithmetic coding , 2013, 1311.2540.

[5]  Andrew Trotman,et al.  Optimal Packing in Simple-Family Codecs , 2015, ICTIR.

[6]  Giuseppe Ottaviano,et al.  Partitioned Elias-Fano indexes , 2014, SIGIR.

[7]  Andrew Trotman,et al.  Compressing Inverted Files , 2004, Information Retrieval.

[8]  Alistair Moffat,et al.  Compression and Coding Algorithms , 2005, IEEE Trans. Inf. Theory.

[9]  Aviezri S. Fraenkel,et al.  Novel Compression of Sparse Bit-Strings — Preliminary Report , 1985 .

[10]  Andrew Trotman Compression, SIMD, and Postings Lists , 2014, ADCS '14.

[11]  Jarek Duda,et al.  Asymmetric numeral systems , 2009, ArXiv.

[12]  Hugh E. Williams,et al.  Compressing Integers for Fast File Access , 1999, Comput. J..

[13]  Gonzalo Navarro,et al.  (S, C)-Dense Coding: An Optimized Compression Code for Natural Language Text Databases , 2003, SPIRE.

[14]  Torsten Suel,et al.  Faster top-k document retrieval using block-max indexes , 2011, SIGIR.

[15]  Alistair Moffat,et al.  Parameterised compression for sparse bitmaps , 1992, SIGIR '92.

[16]  Giuseppe Ottaviano,et al.  Optimal Space-time Tradeoffs for Inverted Indexes , 2015, WSDM.

[17]  Alistair Moffat,et al.  Inverted Index Compression Using Word-Aligned Binary Codes , 2004, Information Retrieval.

[18]  Torsten Suel,et al.  Inverted index compression and query processing with optimized document ordering , 2009, WWW '09.

[19]  Fabian Giesen,et al.  Interleaved entropy coders , 2014, ArXiv.

[20]  J. Shane Culpepper,et al.  Enhanced Byte Codes with Restricted Prefix Properties , 2005, SPIRE.

[21]  Marcin Zukowski,et al.  Super-Scalar RAM-CPU Cache Compression , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[22]  Shmuel Tomi Klein,et al.  Compression of concordances in full-text retrieval systems , 1988, SIGIR '88.

[23]  Torsten Suel,et al.  Optimizing top-k document retrieval strategies for block-max indexes , 2013, WSDM.

[24]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[25]  Yannis Papakonstantinou,et al.  MILC: Inverted List Compression in Memory , 2017, Proc. VLDB Endow..

[26]  Torsten Suel,et al.  Performance of compressed inverted list caching in search engines , 2008, WWW.

[27]  Gang Wang,et al.  Leveraging Context-Free Grammar for Efficient Inverted Index Compression , 2016, SIGIR.

[28]  Alistair Moffat,et al.  Binary Interpolative Coding for Effective Index Compression , 2000, Information Retrieval.

[29]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[30]  Alistair Moffat,et al.  Index compression using 64‐bit words , 2010, Softw. Pract. Exp..