VSEncoding: efficient coding and fast decoding of integer lists via dynamic programming

Encoding lists of integers efficiently is important for many applications in different fields. Adjacency lists of large graphs are usually encoded to save space and to improve decoding speed. Inverted indexes of Information Retrieval systems keep the lists of postings compressed in order to exploit the memory hierarchy. Secondary indexes of DBMSs are stored similarly to inverted indexes in IR systems. In this paper we propose Vector of Splits Encoding (VSEncoding), a novel class of encoders that work by optimally partitioning a list of integers into blocks which are efficiently compressed by using simple encoders. In previous works heuristics were applied during the partitioning step. Instead, we find the optimal solution by using a dynamic programming approach. Experiments show that our class of encoders outperform all the existing methods in literature by more than 10% (with the exception of Binary Interpolative Coding with which they, roughly, tie) still retaining a very fast decompression algorithm.

[1]  Andrew Trotman,et al.  Compressing Inverted Files , 2004, Information Retrieval.

[2]  S. Golomb Run-length encodings. , 1966 .

[3]  Solomon W. Golomb,et al.  Run-length encodings (Corresp.) , 1966, IEEE Trans. Inf. Theory.

[4]  Fabrizio Silvestri,et al.  Assigning identifiers to documents to enhance the clustering property of fulltext indexes , 2004, SIGIR '04.

[5]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[6]  Alistair Moffat,et al.  Inverted Index Compression Using Word-Aligned Binary Codes , 2004, Information Retrieval.

[7]  Ian H. Witten,et al.  Managing gigabytes , 1994 .

[8]  Ian H. Witten,et al.  Compressing and indexing documents and images , 1999 .

[9]  Alistair Moffat,et al.  Improved word-aligned binary compression for text indexing , 2006, IEEE Transactions on Knowledge and Data Engineering.

[10]  Paolo Ferragina,et al.  On Optimally Partitioning a Text to Improve Its Compression , 2009, Algorithmica.

[11]  Fabrizio Silvestri,et al.  Sorting Out the Document Identifier Assignment Problem , 2007, ECIR.

[12]  Marcin Zukowski,et al.  Super-Scalar RAM-CPU Cache Compression , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[13]  Sebastiano Vigna,et al.  Codes for the World Wide Web , 2005, Internet Math..

[14]  Silvio Lattanzi,et al.  On placing skips optimally in expectation , 2008, WSDM '08.

[15]  P. Gács,et al.  Algorithms , 1992 .

[16]  Shmuel Tomi Klein,et al.  Modeling word occurrences for the compression of concordances , 1997, TOIS.

[17]  Alistair Moffat,et al.  Binary Interpolative Coding for Effective Index Compression , 2000, Information Retrieval.

[18]  Eugene S. Schwartz,et al.  Generating a canonical prefix encoding , 1964, CACM.

[19]  Peter Elias,et al.  Universal codeword sets and representations of the integers , 1975, IEEE Trans. Inf. Theory.

[20]  Xiaowei Shen,et al.  Performance of hardware compressed main memory , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[21]  Frank Dehne,et al.  Exploring the Limits of GPUs With Parallel Graph Algorithms , 2010, ArXiv.

[22]  Torsten Suel,et al.  Inverted index compression and query processing with optimized document ordering , 2009, WWW '09.

[23]  Fabrizio Silvestri,et al.  Mining Query Logs: Turning Search Usage Data into Knowledge , 2010, Found. Trends Inf. Retr..