Software-based instruction caching for embedded processors

While hardware instruction caches are present in virtually all general-purpose and high-performance microprocessors today, many embedded processors use SRAM or scratchpad memories instead. These are simple array memory structures that are directly addressed and explicitly managed by software. Compared to hardware caches of the same data capacity, they are smaller, have shorter access times and consume less energy per access. Access times are also easier to predict with simple memories since there is no possibility of a "miss." On the other hand, they are more difficult for the programmer to use since they are not automatically managed.In this paper, we present a software system that allows all or part of an SRAM or scratchpad memory to be automatically managed as a cache. This system provides the programming convenience of a cache for processors that lack dedicated caching hardware. It has been implemented for an actual processor and runs on real hardware. Our results show that a software-based instruction cache can be built that provides performance within 10% of a traditional hardware cache on many benchmarks while using a cheaper, simpler, SRAM memory. On these same benchmarks, energy consumption is up to 3% lower than it would be using a hardware cache.

[1]  Rajeev Barua,et al.  Heap data allocation to scratch-pad memory in embedded systems , 2005, J. Embed. Comput..

[2]  Thomas R. Spacek A proposal to establish a pseudo virtual memory via writable overlays , 1972, CACM.

[3]  David May,et al.  Novel Caches for Predictable Computing , 1998 .

[4]  Babak Falsafi,et al.  Exploiting choice in resizable cache design to optimize deep-submicron processor energy-delay , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[5]  Scott A. Mahlke,et al.  Compiler managed dynamic instruction placement in a low-power code cache , 2005, International Symposium on Code Generation and Optimization.

[6]  Michael Gschwind,et al.  Dynamic Binary Translation and Optimization , 2001, IEEE Trans. Computers.

[7]  Peter Marwedel,et al.  Dynamic overlay of scratchpad memory for energy minimization , 2004, International Conference on Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS 2004..

[8]  Michael Zhang,et al.  Highly-Associative Caches for Low-Power Processors , 2000 .

[9]  Trevor N. Mudge,et al.  Software-managed address translation , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[10]  David Keppel,et al.  Shade: a fast instruction-set simulator for execution profiling , 1994, SIGMETRICS.

[11]  David R. Cheriton,et al.  Software-Controlled Caches in the VMP Multiprocessor , 1986, ISCA.

[12]  Norman P. Jouppi,et al.  Cacti 3. 0: an integrated cache timing, power, and area model , 2001 .

[13]  David R. Cheriton,et al.  Software-controlled caches in the VMP multiprocessor , 1986, ISCA 1986.

[14]  Mahmut T. Kandemir,et al.  Using complete machine simulation for software power estimation: the SoftWatt approach , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[15]  Norman P. Jouppi,et al.  CACTI: an enhanced cache access and cycle time model , 1996, IEEE J. Solid State Circuits.

[16]  Peter Marwedel,et al.  Assigning program and data objects to scratchpad for energy reduction , 2002, Proceedings 2002 Design, Automation and Test in Europe Conference and Exhibition.

[17]  Henry Hoffmann,et al.  The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs , 2002, IEEE Micro.

[18]  Scott A. Mahlke,et al.  The superblock: An effective technique for VLIW and superscalar compilation , 1993, The Journal of Supercomputing.

[19]  Peng Wu,et al.  Using advanced compiler technology to exploit the performance of the Cell Broadband Enginee , 2006 .

[20]  Mendel Rosenblum,et al.  Embra: fast and flexible machine simulation , 1996, SIGMETRICS '96.

[21]  Paolo Faraboschi,et al.  DELI: a new run-time control point , 2002, MICRO.

[22]  R. J. Pankhurst Operating Systems: Program overlay techniques , 1968, CACM.

[23]  Erik R. Altman,et al.  Daisy: Dynamic Compilation For 10o?40 Architectural Compatibility , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[24]  Miodrag Potkonjak,et al.  MediaBench: a tool for evaluating and synthesizing multimedia and communications systems , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[25]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[26]  Csaba Andras Moritz,et al.  Hot Pages: Software Caching for Raw Microprocessors , 1999 .

[27]  K. Ebcioglu,et al.  Daisy: Dynamic Compilation For 10o?40 Architectural Compatibility , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[28]  Dawson R. Engler,et al.  VCODE: a retargetable, extensible, very fast dynamic code generation system , 1996, PLDI '96.

[29]  Peter Naur The performance of a system for automatic segmentation of programs within an ALGOL compiler (GIER ALGOL) , 1965, CACM.

[30]  Evelyn Duesterwald,et al.  Design and implementation of a dynamic optimization framework for windows , 2000 .

[31]  Richard T. Witek,et al.  A 160 MHz 32 b 0.5 W CMOS RISC microprocessor , 1996, 1996 IEEE International Solid-State Circuits Conference. Digest of TEchnical Papers, ISSCC.

[32]  Henry Hoffmann,et al.  Evaluation of the Raw microprocessor: an exposed-wire-delay architecture for ILP and streams , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[33]  Martin Hopkins,et al.  Synergistic Processing in Cell's Multicore Architecture , 2006, IEEE Micro.

[34]  C. May Mimic: a fast system/370 simulator , 1987, PLDI 1987.

[35]  Steven K. Reinhardt,et al.  A fully associative software-managed cache design , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[36]  Peter Marwedel,et al.  Scratchpad memory: a design alternative for cache on-chip memory in embedded systems , 2002, Proceedings of the Tenth International Symposium on Hardware/Software Codesign. CODES 2002 (IEEE Cat. No.02TH8627).

[37]  Philip Machanick,et al.  Hardware-software trade-offs in a direct Rambus implementation of the RAMpage memory hierarchy , 1998, ASPLOS VIII.

[38]  Peter J. Denning,et al.  Virtual memory , 1970, CSUR.

[39]  Derek Bruening,et al.  Secure Execution via Program Shepherding , 2002, USENIX Security Symposium.

[40]  James E. Smith,et al.  Exploring code cache eviction granularities in dynamic optimization systems , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[41]  Luca Benini,et al.  A post-compiler approach to scratchpad mapping of code , 2004, CASES '04.