Index Compression Using Byte-Aligned ANS Coding and Two-Dimensional Contexts

We examine approaches used for block-based inverted index compression, such as the OptPFOR mechanism, in which fixed-length blocks of postings data are compressed independently of each other. Building on previous work in which asymmetric numeral systems (ANS) entropy coding is used to represent each block, we explore a number of enhancements: (i) the use of two-dimensional conditioning contexts, with two aggregate parameters used in each block to categorize the distribution of symbol values that underlies the ANS approach, rather than just one; (ii) the use of a byte-friendly strategic mapping from symbols to ANS codeword buckets; and (iii) the use of a context merging process to combine similar probability distributions. Collectively, these improvements yield superior compression for index data, outperforming the reference point set by the Interp mechanism, and hence representing a significant step forward. We describe experiments using the 426 GiB gov2 collection and a new large collection of publicly-available news articles to demonstrate that claim, and provide query evaluation throughput rates compared to other block-based mechanisms.

[1]  Torsten Suel,et al.  Inverted index compression and query processing with optimized document ordering , 2009, WWW '09.

[2]  Giuseppe Ottaviano,et al.  Faster BlockMax WAND with Variable-sized Blocks , 2017, SIGIR.

[3]  Alistair Moffat,et al.  Compression and Coding Algorithms , 2005, IEEE Trans. Inf. Theory.

[4]  Rossano Venturini,et al.  Clustered Elias-Fano Indexes , 2017, ACM Trans. Inf. Syst..

[5]  Leonid Boytsov,et al.  Decoding billions of integers per second through vectorization , 2012, Softw. Pract. Exp..

[6]  Shmuel Tomi Klein,et al.  Compression of concordances in full-text retrieval systems , 1988, SIGIR '88.

[7]  Jarek Duda,et al.  Asymmetric numeral systems: entropy coding combining speed of Huffman coding with compression rate of arithmetic coding , 2013, 1311.2540.

[8]  Aviezri S. Fraenkel,et al.  Novel Compression of Sparse Bit-Strings — Preliminary Report , 1985 .

[9]  Jarek Duda,et al.  Asymmetric numeral systems , 2009, ArXiv.

[10]  Alistair Moffat,et al.  Compact inverted index storage using general‐purpose compression libraries , 2018, Softw. Pract. Exp..

[11]  Alistair Moffat,et al.  Inverted Index Compression Using Word-Aligned Binary Codes , 2004, Information Retrieval.

[12]  Andrew Trotman,et al.  Compressing Inverted Files , 2004, Information Retrieval.

[13]  Hugh E. Williams,et al.  Compressing Integers for Fast File Access , 1999, Comput. J..

[14]  Edward J. Delp,et al.  The use of asymmetric numeral systems as an accurate replacement for Huffman coding , 2015, 2015 Picture Coding Symposium (PCS).

[15]  Alistair Moffat,et al.  Compressed inverted files with reduced decoding overheads , 1998, SIGIR '98.

[16]  Gonzalo Navarro,et al.  (S, C)-Dense Coding: An Optimized Compression Code for Natural Language Text Databases , 2003, SPIRE.

[17]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[18]  Yannis Papakonstantinou,et al.  MILC: Inverted List Compression in Memory , 2017, Proc. VLDB Endow..

[19]  Alistair Moffat,et al.  ANS-Based Index Compression , 2017, CIKM.

[20]  Alistair Moffat,et al.  Index compression using 64‐bit words , 2010, Softw. Pract. Exp..

[21]  Marcin Zukowski,et al.  Super-Scalar RAM-CPU Cache Compression , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[22]  Alistair Moffat,et al.  Binary Interpolative Coding for Effective Index Compression , 2000, Information Retrieval.

[23]  J. Shane Culpepper,et al.  Enhanced Byte Codes with Restricted Prefix Properties , 2005, SPIRE.

[24]  Alistair Moffat,et al.  Parameterised compression for sparse bitmaps , 1992, SIGIR '92.

[25]  Giuseppe Ottaviano,et al.  Partitioned Elias-Fano indexes , 2014, SIGIR.

[26]  Andrew Trotman Compression, SIMD, and Postings Lists , 2014, ADCS '14.

[27]  Torsten Suel,et al.  Performance of compressed inverted list caching in search engines , 2008, WWW.

[28]  Gang Wang,et al.  Leveraging Context-Free Grammar for Efficient Inverted Index Compression , 2016, SIGIR.

[29]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[30]  Giuseppe Ottaviano,et al.  Optimal Space-time Tradeoffs for Inverted Indexes , 2015, WSDM.

[31]  Gonzalo Navarro,et al.  Reorganizing compressed text , 2008, SIGIR '08.