Statistical models for term compression

Summary form only given. Computing systems frequently deal with symbolic tree data structures, which are also known as terms in universal algebra and logic. Our goal is to develop universal, effective and efficient term compression techniques superior to specialized or universal compression techniques currently available. Our approach is to use knowledge of term structure to build accurate universal statistical models of terms. These models can compress terms faster or more effectively than comparable sequential methods. We present two statistical term models that are related to Markov random fields over trees. These models gather statistical information about parent-child symbol relationships in terms. Huffman or arithmetic codes generated from these probability estimates are used to encode the terms. In the first model, a symbol's value is predicted by the value of its parent symbol alone. Thus, in compressing a subterm of the form t+u, the +operator would be used to select a specialized code for the root symbols of subterms t and u. The second model also uses the symbol's argument position as a predictor. For example, in compressing t+u, we would make different probability estimates for the first and second arguments, and use one code to encode the first children of + and another code for the second children. It might be the case that t+1 (but not 1+t) occurs frequently for many different terms t, in which case we could give 1 a shorter code as the second child of +. We have not achieved our goal of improved term compression, but we believe that more sophisticated and more effective techniques remain to be investigated. For example one improvement would be to make a term version of PPM in which contexts are ancestor symbols in the term rather than predecessors in a sequence. Our implementation could be viewed as a first step in that direction.

[1]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[2]  Shmuel Tomi Klein,et al.  Compression, information theory, and grammars: a unified approach , 1990, TOIS.

[3]  Erkki Mäkinen,et al.  Tree Compression and Optimization with Applications , 1990, Int. J. Found. Comput. Sci..

[4]  Ulrich Neumerkel,et al.  A Novel Term Compression Scheme and Data Representation in the BinWAM , 1994, PLILP.

[5]  Martti Penttonen,et al.  Syntax‐directed compression of program files , 1986, Softw. Pract. Exp..

[6]  Bruce E. Hajek,et al.  Information measures for discrete random fields , 1999, IEEE Trans. Inf. Theory.

[7]  Glen G. Langdon,et al.  Universal modeling and coding , 1981, IEEE Trans. Inf. Theory.

[8]  Michael Franz Adaptive Compression of Syntax Trees and Iterative Dynamic Code Optimization: Two Basic Technologies for Mobile Object Systems , 1996, Mobile Object Systems.

[9]  J. Cocke Global common subexpression elimination , 1970, Symposium on Compiler Optimization.

[10]  James Cheney FIRST-ORDER TERM COMPRESSION: TECHNIQUES AND APPLICATIONS , 1998 .

[11]  Philippe Flajolet,et al.  Average-Case Analysis of Algorithms and Data Structures , 1991, Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity.

[12]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..

[13]  Ian H. Witten,et al.  Bonsai: A compact representation of trees , 1993, Softw. Pract. Exp..

[14]  F. Spitzer Markov Random Fields on an Infinite Tree , 1975 .

[15]  Christopher W. Fraser Automatic inference of models for statistical code compression , 1999, PLDI '99.

[16]  George C. Necula,et al.  The design and implementation of a certifying compiler , 1998, PLDI.