Compact inverted index storage using general‐purpose compression libraries

Efficient storage of large inverted indexes is one of the key technologies that support current web search services. Here we re‐examine mechanisms for representing document‐level inverted indexes and within‐document term frequencies, including comparing specialized methods developed for this task against recent fast implementations of general‐purpose adaptive compression techniques. Experiments with the Gov2‐URL collection and a large collection of crawled news stories show that standard compression libraries can provide compression effectiveness as good as or better than previous methods, with decoding rates only moderately slower than reference implementations of those tailored approaches. This surprising outcome means that high‐performance index compression can be achieved without requiring the use of specialized implementations.

[1]  Andrew Trotman,et al.  Compressing Inverted Files , 2004, Information Retrieval.

[2]  Hugh E. Williams,et al.  Compressing Integers for Fast File Access , 1999, Comput. J..

[3]  Alistair Moffat,et al.  Compressed inverted files with reduced decoding overheads , 1998, SIGIR '98.

[4]  Alistair Moffat,et al.  Binary Interpolative Coding for Effective Index Compression , 2000, Information Retrieval.

[5]  Alistair Moffat,et al.  From Theory to Practice: Plug and Play with Succinct Data Structures , 2013, SEA.

[6]  Aviezri S. Fraenkel,et al.  Novel Compression of Sparse Bit-Strings — Preliminary Report , 1985 .

[7]  Marcin Zukowski,et al.  Super-Scalar RAM-CPU Cache Compression , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[8]  Giuseppe Ottaviano,et al.  Partitioned Elias-Fano indexes , 2014, SIGIR.

[9]  Shmuel Tomi Klein,et al.  Modeling word occurrences for the compression of concordances , 1997, TOIS.

[10]  Leonid Boytsov,et al.  Decoding billions of integers per second through vectorization , 2012, Softw. Pract. Exp..

[11]  Alistair Moffat,et al.  Inverted Index Compression Using Word-Aligned Binary Codes , 2004, Information Retrieval.

[12]  Rossano Venturini,et al.  Clustered Elias-Fano Indexes , 2017, ACM Trans. Inf. Syst..

[13]  Yannis Papakonstantinou,et al.  An Experimental Study of Bitmap Compression vs. Inverted List Compression , 2017, SIGMOD Conference.

[14]  Torsten Suel,et al.  Faster top-k document retrieval using block-max indexes , 2011, SIGIR.

[15]  Peter Elias,et al.  Efficient Storage and Retrieval by Content and Address of Static Files , 1974, JACM.

[16]  Giuseppe Ottaviano,et al.  Optimal Space-time Tradeoffs for Inverted Indexes , 2015, WSDM.

[17]  Sebastiano Vigna,et al.  Quasi-succinct indices , 2012, WSDM.

[18]  Alistair Moffat,et al.  Structured Index Organizations for High-Throughput Text Querying , 2006, SPIRE.

[19]  Jarek Duda,et al.  Asymmetric numeral systems: entropy coding combining speed of Huffman coding with compression rate of arithmetic coding , 2013, 1311.2540.

[20]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[21]  Alistair Moffat,et al.  Parameterised compression for sparse bitmaps , 1992, SIGIR '92.

[22]  Andrew Trotman Compression, SIMD, and Postings Lists , 2014, ADCS '14.

[23]  Alistair Moffat,et al.  ANS-Based Index Compression , 2017, CIKM.

[24]  Alistair Moffat,et al.  Index compression using 64‐bit words , 2010, Softw. Pract. Exp..

[25]  Torsten Suel,et al.  Compressing term positions in web indexes , 2009, SIGIR.

[26]  Torsten Suel,et al.  Optimizing top-k document retrieval strategies for block-max indexes , 2013, WSDM.

[27]  Yannis Papakonstantinou,et al.  MILC: Inverted List Compression in Memory , 2017, Proc. VLDB Endow..

[28]  Torsten Suel,et al.  Performance of compressed inverted list caching in search engines , 2008, WWW.

[29]  Gang Wang,et al.  Leveraging Context-Free Grammar for Efficient Inverted Index Compression , 2016, SIGIR.