REMcode: relocating embedded code for improving system efficiency

The memory hierarchy subsystem has a significant impact on performance and energy consumption of an embedded system. Methods which increase the hit ratio of the cache hierarchy will typically enhance the performance and reduce the embedded system's total energy consumption. This is mainly due to reduced cache-to-memory bus transactions, fewer main memory accesses and fewer processor waiting cycles. A heuristic approach is presented to reduce the total number of cache misses by carefully relocating selected sections of the application's software code within the main memory, thus reducing conflict misses resulting from the cache hierarchy. The method requires no hardware modifications i.e. it is a software-only approach. For the first time such a method is applied to large program traces, and the miss rates and corresponding energy savings are observed while varying cache size, line size and associativity. Relocating the code consistently produces superior performance on direct-mapped cache. Since direct-mapped caches, being smaller in silicon area than caches with higher associativity (for the same size), cost less in terms of energy/access, and access faster, using direct-mapped instruction cache with code relocation for performance-oriented embedded systems is recommended. A maximum cache miss rate reduction from 71% down to less than 1% is achieved, with energy reductions of up to 63% with only a small increase in main memory size.

[1]  Rajesh K. Gupta,et al.  Power savings in embedded processors through decode filter cache , 2002, Proceedings 2002 Design, Automation and Test in Europe Conference and Exhibition.

[2]  Ibrahim N. Hajj,et al.  Architectural and compiler support for energy reduction in the memory hierarchy of high performance microprocessors , 1998, Proceedings. 1998 International Symposium on Low Power Electronics and Design (IEEE Cat. No.98TH8379).

[3]  Jay K. Strosnider,et al.  SMART (strategic memory allocation for real-time) cache design using the MIPS R3000 , 1990, [1990] Proceedings 11th Real-Time Systems Symposium.

[4]  David R. Karger,et al.  Near-optimal intraprocedural branch alignment , 1997, PLDI '97.

[5]  L. Benini,et al.  Cycle-accurate simulation of energy consumption in embedded systems , 1999, Proceedings 1999 Design Automation Conference (Cat. No. 99CH36361).

[6]  James R. Larus,et al.  Wisconsin Architectural Research Tool Set , 1993, CARN.

[7]  Frank Vahid,et al.  Synthesis of customized loop caches for core-based embedded systems , 2002, IEEE/ACM International Conference on Computer Aided Design, 2002. ICCAD 2002..

[8]  Miodrag Potkonjak,et al.  Power optimization of variable-voltage core-based systems , 1999, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[9]  Frank Vahid,et al.  A hybrid approach for core-based system-level power modeling , 2000, ASP-DAC '00.

[10]  Weiyu Tang,et al.  Reducing power with an L0 instruction cache using history-based prediction , 2002, International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems.

[11]  Scott A. Mahlke,et al.  IMPACT: an architectural framework for multiple-instruction-issue processors , 1991, ISCA '91.

[12]  Wen-mei W. Hwu,et al.  Achieving High Instruction Cache Performance With An Optimizing Compiler , 1989, The 16th Annual International Symposium on Computer Architecture.

[13]  Niraj K. Jha,et al.  COSYN: hardware-software co-synthesis of embedded systems , 1997, DAC.

[14]  Fred C. Chow,et al.  A portable machine-independent global optimizer--design and measurements , 1984 .

[15]  Nikil D. Dutt,et al.  Efficient utilization of scratch-pad memory in embedded processor applications , 1997, Proceedings European Design and Test Conference. ED & TC 97.

[16]  Peter Marwedel,et al.  Scratchpad memory: a design alternative for cache on-chip memory in embedded systems , 2002, Proceedings of the Tenth International Symposium on Hardware/Software Codesign. CODES 2002 (IEEE Cat. No.02TH8627).

[17]  Scott A. Mahlke,et al.  IMPACT: An Architectural Framework for Multiple-Instruction-Issue Processors , 1998, 25 Years ISCA: Retrospectives and Reprints.

[18]  Kaushik Roy,et al.  Reducing leakage in a high-performance deep-submicron instruction cache , 2001, IEEE Trans. Very Large Scale Integr. Syst..

[19]  Anand Raghunathan,et al.  Efficient power co-estimation techniques for system-on-chip design , 2000, Proceedings Design, Automation and Test in Europe Conference and Exhibition 2000 (Cat. No. PR00537).

[20]  Scott McFarling,et al.  Program optimization for instruction caches , 1989, ASPLOS III.

[21]  Frank Vahid,et al.  Energy benefits of a configurable line size cache for embedded systems , 2003, IEEE Computer Society Annual Symposium on VLSI, 2003. Proceedings..

[22]  Frank Vahid,et al.  A power-configurable bus for embedded systems , 2002, 2002 IEEE International Symposium on Circuits and Systems. Proceedings (Cat. No.02CH37353).

[23]  Sharad Malik,et al.  Performance estimation of embedded software with instruction cache modeling , 1999, TODE.

[24]  Miodrag Potkonjak,et al.  Application-driven synthesis of memory-intensive systems-on-chip , 1999, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[25]  Sharad Malik,et al.  Performance estimation of embedded software with instruction cache modeling , 1995, ICCAD.

[26]  Hiroyuki Tomiyama,et al.  Code placement techniques for cache miss rate reduction , 1997, TODE.

[27]  Jörg Henkel,et al.  I-CoPES: fast instruction code placement for embedded systems to improve performance and energy efficiency , 2001, IEEE/ACM International Conference on Computer Aided Design. ICCAD 2001. IEEE/ACM Digest of Technical Papers (Cat. No.01CH37281).

[28]  W. Wolf,et al.  Hardware/software co-synthesis with memory hierarchies , 1998, 1998 IEEE/ACM International Conference on Computer-Aided Design. Digest of Technical Papers (IEEE Cat. No.98CB36287).

[29]  Miodrag Potkonjak,et al.  Power optimization of variable voltage core-based systems , 1998, Proceedings 1998 Design and Automation Conference. 35th DAC. (Cat. No.98CH36175).

[30]  Norman P. Jouppi,et al.  Cacti 3. 0: an integrated cache timing, power, and area model , 2001 .

[31]  D. B. Kirk,et al.  SMART (strategic memory allocation for real-time) cache design , 1989, [1989] Proceedings. Real-Time Systems Symposium.

[32]  Scott McFarling,et al.  Procedure merging with instruction caches , 1991, PLDI '91.

[33]  Jörg Henkel,et al.  A framework for estimating and minimizing energy dissipation of embedded HW/SW systems , 2001 .