Automatic inference of models for statistical code compression

This paper describes experiments that apply machine learning to compress computer programs, formalizing and automating decisions about instruction encoding that have traditionally been made by humans in a more ad hoc manner. A program accepts a large training set of program material in a conventional compiler intermediate representation (IR) and automatically infers a decision tree that separates IR code into streams that compress much better than the undifferentiated whole. Driving a conventional arithmetic compressor with this model yields code 30% smaller than the previous record for IR code compression, and 24% smaller than an ambitious optimizing compiler feeding an ambitious general-purpose data compressor.

[1]  Richard E. Sweet,et al.  Empirical analysis of the mesa instruction set , 1982, ASPLOS I.

[2]  Michael Franz,et al.  Slim binaries , 1997, CACM.

[3]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[4]  Tong Lai Yu Data Compression for PC Software Distribution , 1996, Softw. Pract. Exp..

[5]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[6]  Christopher W. Fraser,et al.  Code compression , 1997, PLDI '97.

[7]  Christopher W. Fraser,et al.  A Retargetable C Compiler: Design and Implementation , 1995 .

[8]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[9]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[10]  Tong Lai Yu Data compression for PC software distribution , 1996 .

[11]  R. Nigel Horspool,et al.  Tailored Compression of Java Class Files , 1998, Softw. Pract. Exp..

[12]  Pat Langley,et al.  Elements of Machine Learning , 1995 .

[13]  Michael Franz Adaptive Compression of Syntax Trees and Iterative Dynamic Code Optimization: Two Basic Technologies for Mobile Object Systems , 1996, Mobile Object Systems.

[14]  Todd A. Proebsting Optimizing an ANSI C interpreter with superoperators , 1995, POPL '95.

[15]  Eve A. Riskin,et al.  Lookahead in growing tree-structured vector quantizers , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[16]  Abraham Lempel,et al.  On the Complexity of Finite Sequences , 1976, IEEE Trans. Inf. Theory.

[17]  Masaaki Mizuno,et al.  A DAG-based algorithm for distributed mutual exclusion , 1991, [1991] Proceedings. 11th International Conference on Distributed Computing Systems.

[18]  David Maxwell Chickering,et al.  A Bayesian Approach to Learning Bayesian Networks with Local Structure , 1997, UAI.

[19]  Christopher W. Fraser,et al.  Custom Instruction Sets for Code Compression , 1995 .