Efficient execution of compressed programs

Code compression is the technique of using data compression to reduce the program memory size for memory-limited, embedded computers. For system-on-a-chip designs, this reduces the system die area which lowers die cost. After compilation, the binary (native code) program is compressed and stored in the embedded system. At run-time, the compressed program is incrementally decompressed and executed. While compressed programs have better code density, their performance is typically lower because additional effort is required to decompress the instruction stream. This dissertation presents methods to improve the performance of compressed programs. Decompression overhead can be minimized by using special-purpose hardware. This dissertation analyzes IBM's CodePack decompression algorithm and proposes optimizations for it. The optimized decompressor can often execute compressed programs faster than the original native program. The performance benefit of using fewer memory transactions to fetch compressed instructions surpasses the small decompression overhead. Therefore, code compression improves performance as well as code density. The decompression hardware can be largely replaced with software. The benefits of software decompression are greater design flexibility, reduced hardware complexity, reduced die area, and reduced cost. However, software decompression is much slower than hardware decompression. On a 5-stage pipelined embedded processor with a 4KB instruction cache, CodePack programs execute 1.3 to 27.0 times slower than native programs and reduce program memory die area (instruction cache and main memory) by 26% to 41%. This dissertation proposes instruction set support to enable efficient software-managed decompression. In addition, it explores two software optimizations, hybrid programs and memoization, to improve the execution time of compressed programs by reducing the compression. Hybrid programs contain both native and compressed code to reduce the number of times the decompressor is invoked. Memoization is a dynamic optimization that caches recent decompression results to also avoid invoking the decompressor. Optimized compressed programs that reduce die area 10% to 33% execute only 1.00 to 1.22 times slower than native code. In addition, loop-oriented (multimedia) programs are nearly as fast as native code.

[1]  M. Kozuch,et al.  Compression of embedded system programs , 1994, Proceedings 1994 IEEE International Conference on Computer Design: VLSI in Computers and Processors.

[2]  Ian H. Witten,et al.  Text Compression , 1990, 125 Problems in Text Algorithms.

[3]  Guido Araujo,et al.  Code compression based on operand factorization , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[4]  Larry Rudolph,et al.  Creating a wider bus using caching techniques , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[5]  Michael Franz,et al.  Slim binaries , 1997, CACM.

[6]  Lex Augusteijn,et al.  A code compression system based on pipelined interpreters , 1999 .

[7]  Bjorn De Sutter,et al.  Compiler techniques for code compaction , 2000, TOPL.

[8]  Christopher W. Fraser,et al.  Code compression , 1997, PLDI '97.

[9]  Kevin D. Kissell MIPS16: High-density MIPS for the Embedded Market1 , 1997 .

[10]  Fred Douglis,et al.  The Compression Cache: Using On-line Compression to Extend Physical Memory , 1993, USENIX Winter.

[11]  Thomas R. Gross,et al.  Combining the concepts of compression and caching for a two-level filesystem , 1991, ASPLOS IV.

[12]  Thomas M. Conte,et al.  Compiler-driven cached code compression schemes for embedded ILP processors , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[13]  Kurt Keutzer,et al.  Code density optimization for embedded DSP processors using data compression techniques , 1995, Proceedings Sixteenth Conference on Advanced Research in VLSI.

[14]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and pre , 1990, ISCA 1990.

[15]  Michael J. Flynn,et al.  Execution Architecture: The DELtran Experiment , 1983, IEEE Transactions on Computers.

[16]  Andrew Wolfe,et al.  A fast asynchronous Huffman decoder for compressed-code embedded processors , 1998, Proceedings Fourth International Symposium on Advanced Research in Asynchronous Circuits and Systems.

[17]  Ian H. Witten,et al.  Arithmetic coding for data compression , 1987, CACM.

[18]  Ross N. Williams,et al.  An extremely fast Ziv-Lempel data compression algorithm , 1991, [1991] Proceedings. Data Compression Conference.

[19]  Trevor Mudge,et al.  Enhancing the instruction fetching mechanism using data compression , 1997 .

[20]  Andrew Wolfe,et al.  Executing compressed programs on an embedded RISC architecture , 1992, MICRO 1992.

[21]  DONALD MICHIE,et al.  “Memo” Functions and Machine Learning , 1968, Nature.

[22]  Keith D. Cooper,et al.  Enhanced code compression for embedded RISC processors , 1999, PLDI '99.

[23]  Thomas G. Szymanski,et al.  Assembling code for machines with span-dependent instructions , 1978, CACM.

[24]  A. Cozzolino,et al.  Powerpc microprocessor family: the programming environments , 1994 .

[25]  Stan Y. Liao,et al.  Code generation and optimization for embedded digital signal processors , 1996 .

[26]  Bruce Jacob,et al.  Cache Design for Embedded Real-Time Systems , 1999 .

[27]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[28]  Trevor Mudge,et al.  Code Compression for DSP , 1998 .

[29]  William A. Wulf,et al.  The Design of an Optimizing Compiler , 1975 .

[30]  Trevor N. Mudge,et al.  Software-managed address translation , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[31]  Trevor Mudge,et al.  The Impact of Instruction Compression on I-cache Performance , 1997 .

[32]  Wayne H. Wolf,et al.  Code compression for embedded systems , 1998, Proceedings 1998 Design and Automation Conference. 35th DAC. (Cat. No.98CH36175).

[33]  Robert K. Montoye,et al.  A decompression core for PowerPC , 1998, IBM J. Res. Dev..

[34]  Sharon E. Perl,et al.  Studies of Windows NT performance using dynamic execution traces , 1996, OSDI '96.

[35]  Donald S. Fussell,et al.  16-bit vs. 32-bit instructions for pipelined microprocessors , 1993, ISCA '93.

[36]  Margaret Martonosi,et al.  Informing memory operations: memory performance feedback mechanisms and their applications , 1998, TOCS.

[37]  Jack W. Davidson,et al.  Profile guided code positioning , 1990, SIGP.

[38]  Andrew Wolfe,et al.  A high-speed asynchronous decompression circuit for embedded processors , 1997, Proceedings Seventeenth Conference on Advanced Research in VLSI.

[39]  Mark Taunton,et al.  Compressed Executables: An Exercise in Thinking Small , 1991, USENIX Summer.

[40]  Trevor N. Mudge,et al.  Improving code density using compression techniques , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[41]  Miodrag Potkonjak,et al.  MediaBench: a tool for evaluating and synthesizing multimedia and communications systems , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[42]  Daniel S. Hirschberg,et al.  Data compression , 1987, CSUR.