Integer Set Compression and Statistical Modeling

Compression of integer sets and sequences has been extensively studied for settings where elements follow a uniform probability distribution. In addition, methods exist that exploit clustering of elements in order to achieve higher compression performance. In this work, we address the case where enumeration of elements may be arbitrary or random, but where statistics is kept in order to estimate probabilities of elements. We present a recursive subset-size encoding method that is able to benefit from statistics, explore the effects of permuting the enumeration order based on element probabilities, and discuss general properties and possibilities for this class of compression problem.

[1]  Frank Rubin,et al.  Arithmetic stream coding using fixed precision registers , 1979, IEEE Trans. Inf. Theory.

[2]  Jukka Teuhola Interpolative coding of integer sequences supporting log-time random access , 2011, Inf. Process. Manag..

[3]  Ian H. Witten,et al.  Compressing and indexing documents and images , 1999 .

[4]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[5]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[6]  C. Y. Lee Representation of switching circuits by binary-decision programs , 1959 .

[7]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[8]  Jukka Teuhola Tournament Coding of Integer Sequences , 2009, Comput. J..

[9]  Yuriy A. Reznik,et al.  Coding of Sets of Words , 2011, 2011 Data Compression Conference.

[10]  R. Baierlein Probability Theory: The Logic of Science , 2004 .

[11]  Peter Elias,et al.  Universal codeword sets and representations of the integers , 1975, IEEE Trans. Inf. Theory.

[12]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[13]  S. Golomb Run-length encodings. , 1966 .

[14]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[15]  Alistair Moffat Compressing Integer Sequences and Sets , 2008, Encyclopedia of Algorithms.

[16]  Sebastiano Vigna,et al.  Codes for the World Wide Web , 2005, Internet Math..

[17]  Thomas M. Cover,et al.  Enumerative source encoding , 1973, IEEE Trans. Inf. Theory.

[18]  Agner Fog,et al.  Calculation Methods for Wallenius' Noncentral Hypergeometric Distribution , 2008, Commun. Stat. Simul. Comput..

[19]  Jilles Vreeken,et al.  Item Sets that Compress , 2006, SDM.

[20]  Vincent Gripon,et al.  Compressing multisets using tries , 2012, 2012 IEEE Information Theory Workshop.

[21]  N. Jesper Larsson Considerations and Algorithms for Compression of Sets , 2013, 2013 Data Compression Conference.

[22]  Alistair Moffat,et al.  Binary Interpolative Coding for Effective Index Compression , 2000, Information Retrieval.