Efficient Inverted Index Compression Algorithm Characterized by Faster Decompression Compared with the Golomb-Rice Algorithm

This article deals with compression of binary sequences with a given number of ones, which can also be considered as a list of indexes of a given length. The first part of the article shows that the entropy H of random n-element binary sequences with exactly k elements equal one satisfies the inequalities klog2(0.48·n/k)<H<klog2(2.72·n/k). Based on this result, we propose a simple coding using fixed length words. Its main application is the compression of random binary sequences with a large disproportion between the number of zeros and the number of ones. Importantly, the proposed solution allows for a much faster decompression compared with the Golomb-Rice coding with a relatively small decrease in the efficiency of compression. The proposed algorithm can be particularly useful for database applications for which the speed of decompression is much more important than the degree of index list compression.

[1]  R. Rice,et al.  Adaptive Variable-Length Coding for Efficient Compression of Spacecraft Television Data , 1971 .

[2]  Jacob Ziv,et al.  The Universal LZ77 Compression Algorithm Is Essentially Optimal for Individual Finite-Length $N$-Blocks , 2009, IEEE Transactions on Information Theory.

[3]  Andrew Trotman,et al.  Compressing Inverted Files , 2004, Information Retrieval.

[4]  Robert F. Rice,et al.  Some practical universal noiseless coding techniques , 1979 .

[5]  V. Glory,et al.  Extended Rice Code and Its application to R-Tree Compression , 2015 .

[6]  Aviezri S. Fraenkel,et al.  Novel Compression of Sparse Bit-Strings — Preliminary Report , 1985 .

[7]  Paolo Ferragina,et al.  A simple storage scheme for strings achieving entropy bounds , 2007, SODA '07.

[8]  Adriana Vasilache Order adaptive Golomb rice coding for high variability sources , 2017, 2017 25th European Signal Processing Conference (EUSIPCO).

[9]  Charles L. A. Clarke,et al.  Information Retrieval - Implementing and Evaluating Search Engines , 2010 .

[10]  A. Koike,et al.  Hierarchical image transmission system for telemedicine using segmented wavelet transform and Golomb-Rice codes , 1999, Seamless Interconnection for Universal Services. Global Telecommunications Conference. GLOBECOM'99. (Cat. No.99CH37042).

[11]  Sandanam Domnic,et al.  Extended Golomb Code for Integer Representation , 2007, IEEE Transactions on Multimedia.

[12]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[13]  Peter Elias,et al.  Efficient Storage and Retrieval by Content and Address of Static Files , 1974, JACM.

[14]  Torsten Suel,et al.  Performance of compressed inverted list caching in search engines , 2008, WWW.

[15]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[16]  S. Golomb Run-length encodings. , 1966 .

[17]  Roberto Rinaldo,et al.  Lossless compression of video using temporal information , 2003, IEEE Trans. Image Process..

[18]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[19]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[20]  MoffatAlistair,et al.  Inverted Index Compression Using Word-Aligned Binary Codes , 2005 .

[21]  Peter Elias,et al.  Universal codeword sets and representations of the integers , 1975, IEEE Trans. Inf. Theory.

[22]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[23]  A. Kiely Selecting the Golomb Parameter in Rice Coding , 2004 .

[24]  Giovanni Motta,et al.  Handbook of Data Compression , 2009 .

[25]  A. R. Crathorne,et al.  Economic Control of Quality of Manufactured Product. , 1933 .

[26]  C. Tharini,et al.  Analysis and implementation of novel Rice Golomb coding algorithm for wireless sensor networks , 2020, Comput. Commun..

[27]  Alistair Moffat,et al.  Inverted Index Compression Using Word-Aligned Binary Codes , 2004, Information Retrieval.

[28]  Sebastiano Vigna,et al.  Codes for the World Wide Web , 2005, Internet Math..

[29]  Heiko Schwarz,et al.  Reduced-complexity entropy coding of transform coefficient levels using truncated golomb-rice codes in video compression , 2011, 2011 18th IEEE International Conference on Image Processing.

[30]  Takehiro Moriya,et al.  Integer Nesting/Splitting for Golomb-Rice Coding of Generalized Gaussian Sources , 2018, 2018 Data Compression Conference.

[31]  W. Edwards Deming,et al.  Out of the Crisis , 1982 .

[32]  Rossano Venturini,et al.  Techniques for Inverted Index Compression , 2019, ACM Comput. Surv..

[33]  Timothy C. Bell,et al.  Compression of sparse matrices by blocked rice coding , 2001, IEEE Trans. Inf. Theory.

[34]  Ning Zhang,et al.  Lossless compression of color mosaic images , 2006, IEEE Transactions on Image Processing.

[35]  Alberto Apostolico,et al.  Robust transmission of unbounded strings using Fibonacci representations , 1987, IEEE Trans. Inf. Theory.

[36]  P. Fenwick Punctured Elias Codes for Variable-Length Coding of the Integers , 1996 .

[37]  G. A. Barnard,et al.  Transmission of Information: A Statistical Theory of Communications. , 1961 .

[38]  David C. van Voorhis,et al.  Optimal source codes for geometrically distributed integer alphabets (Corresp.) , 1975, IEEE Trans. Inf. Theory.

[39]  Craig MacDonald,et al.  On Inverted Index Compression for Search Engine Efficiency , 2014, ECIR.

[40]  Lukasz Pasko,et al.  Methods of Data Mining for Quality Assurance in Glassworks , 2019, PRO-VE.

[41]  H. Robbins A Remark on Stirling’s Formula , 1955 .

[42]  V. Mascioni AN INEQUALITY FOR THE BINARY ENTROPY FUNCTION AND AN APPLICATION TO BINOMIAL COEFFICIENTS , 2012 .

[43]  Alistair Moffat,et al.  Binary Interpolative Coding for Effective Index Compression , 2000, Information Retrieval.

[44]  Robert F. Rice Some practical universal noiseless coding techniques, part 3, module PSl14,K+ , 1991 .

[45]  Takehiro Moriya,et al.  Optimal Golomb-Rice Code Extension for Lossless Coding of Low-Entropy Exponentially Distributed Sources , 2018, IEEE Transactions on Information Theory.

[46]  T. Robinson Simple Lossless and Near-lossless Waveform Compression , 1994 .

[47]  David Salomon Prefix compression of sparse binary strings , 2000, CROS.

[48]  Alberto Leon-Garcia,et al.  Efficient run-length encodings , 1982, IEEE Trans. Inf. Theory.

[49]  Gonzalo Navarro,et al.  (S, C)-Dense Coding: An Optimized Compression Code for Natural Language Text Databases , 2003, SPIRE.