A Smart Cache Designed for Embedded Applications

In this paper, we extend our previous investigation of split array and scalar data caches to embedded systems. More specifically we explore reconfigurable data caches where L-1 data caches are optimally partitioned into scalar caches augmented with victim caches and array caches. We do not change cache block size or set-associativities, making it easier to reconfigure cache banks. We also evaluate how any unused portions of cache resources can be used as prefetch buffers and branch target buffers to further improve the performance of applications. Since embedded systems require very careful management of available resources, our approach to configuring L-1 caches can lead to better performance and better energy savings. I. INTRODUCTION For embedded applications, it is necessary to provide the required performance within specified size and power budgets. Studies have found that the on-chip cache is responsible for 50% of the power consumed by an embedded processor (17). Therefore, it is worthwhile investigating new cache organizations to address both performance and power requirements of embedded applications. In this paper we explore how to design reconfigurable caches that achieve high performance for embedded applications while remaining both energy and area efficient. For the last two decades computer architects have proposed various cache-control mechanisms and novel cache architectures that detect program access patterns and fine- tune cache policies to improve both the overall cache use and data localities for desktop applications. Major cache optimization techniques (to improve either or both miss rate and miss penalty) include increasing block size and cache size, increasing associativity, complementing the regular cache with victim cache, prefetching data, including additional cache hierarchies. Since for embedded applications it is necessary to provide the required performance within specified size and power budgets, most of these techniques often are not implemented. In our previous work (20) we have studied each of these different cache-control mechanisms and performed comprehensive evaluation of our proposed partitioned caches. Our results demonstrated that split-caches can outperform all of these conventional cache optimization techniques. In this paper we adapt and further extend these studies for embedded systems, with the primary goal of energy savings while maintaining execution performance, yet using significantly smaller data caches. In addition to partitioning data caches into array (or stream) and scalar caches, we investigate how the split caches can be optimally reconfigured for each application. Our studies show significant savings in power and cache capacities. By using these saved area and power for other architectural features to implement different cache optimization techniques, additional performance gains can be achieved for embedded applications. We assume that caches can be designed to permit reconfigurability (10). Previous studies investigated configuring block sizes and set-associativities. In this paper we only explore configuring caches by changing cache sizes, without changing associativity or block sizes. The reconfigurability is achieved by using a configuration vector that can be loaded with a new configuration before an application starts executing. The optimal cache sizes are explored off-line by searching through possible configurations. Our studies show that for L-1 cache system, reconfigurable caches consisting of an instruction cache with prefetching and split data caches (scalar data cache augmented with victim cache, and a separate array data cache) are effective for embedded systems. With such a L-1 cache organization for embedded applications, our results show significant reductions in the number of cache misses, translating into reduced cache access times; reductions in required cache capacities, power consumptions and reduction in the number of execution cycles. This is primarily because we used separate caches which eliminate conflict among different data type that exhibit divergent access behaviors. Since lower miss rates at L-1 reduce the number of times one needs to access L-2 cache, we can reduce the size of L-2 cache. This saved area can be used for other purposes or further power reductions can be achieved by partially or completely shutting down L-2 caches. The energy savings result from the reduced number of cache misses, which in turn reduces the number of trips to higher levels of memories, often crossing chip boundaries.

[1]  Kanad Ghose,et al.  Energy-efficiency of VLSI caches: a comparative study , 1997, Proceedings Tenth International Conference on VLSI Design.

[2]  Mateo Valero,et al.  Software management of selective and dual data caches , 1997 .

[3]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[4]  Trevor Mudge,et al.  MiBench: A free, commercially representative embedded benchmark suite , 2001 .

[5]  James E. Smith,et al.  The predictability of data values , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[6]  Frank Vahid,et al.  Energy benefits of a configurable line size cache for embedded systems , 2003, IEEE Computer Society Annual Symposium on VLSI, 2003. Proceedings..

[7]  Charles C. Weems,et al.  Application-adaptive intelligent cache memory system , 2002, TECS.

[8]  Norman P. Jouppi,et al.  CACTI: an enhanced cache access and cycle time model , 1996, IEEE J. Solid State Circuits.

[9]  Norman P. Jouppi,et al.  Reconfigurable caches and their application to media processing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[10]  Frank Vahid,et al.  A highly configurable cache architecture for embedded systems , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[11]  G.S. Sohi,et al.  Dynamic instruction reuse , 1997, ISCA '97.

[12]  Frank Vahid,et al.  Using a victim buffer in an application-specific memory hierarchy , 2004, Proceedings Design, Automation and Test in Europe Conference and Exhibition.

[13]  David H. Albonesi,et al.  Selective cache ways: on-demand cache resource allocation , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[14]  Afrin Naz,et al.  Tiny split data-caches make big performance impact for embedded applications , 2006, J. Embed. Comput..

[15]  Nikil D. Dutt,et al.  Automatic tuning of two-level caches to embedded applications , 2004, Proceedings Design, Automation and Test in Europe Conference and Exhibition.

[16]  Todd C. Mowry,et al.  Compiler-based prefetching for recursive data structures , 1996, ASPLOS VII.

[17]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[18]  Jean-Loup Baer,et al.  An effective on-chip preloading scheme to reduce data access penalty , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[19]  Israel Koren,et al.  The minimax cache: an energy-efficient framework for media processors , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.