论文信息 - Reducing impact of cache miss stalls in embedded systems by extracting guaranteed independent instructions

Reducing impact of cache miss stalls in embedded systems by extracting guaranteed independent instructions

Today, embedded processors are expected to be able to run complex, algorithm-heavy, memory-intensive applications that were originally designed and coded for general-purpose processors. As such, the impact of memory latencies on the execution time increasingly becomes evident. All the while, it is also expected that embedded processors be power-conscientious as well as of minimal area impact. As a result, traditional methods for addressing performance and memory latencies, such as multiple issue, out-of-order execution and large, associative caches, are not aptly suited for the embedded domain due to the significant area and power overhead. This paper explores a novel approach to mitigating execution delays caused by memory latencies that would otherwise not be possible in a regular in-order, single-issue embedded processor without large, power-hungry constructs like a Reorder Buffer (ROB). The concept relies on both compile-time and run-time information to safely allow non-data-dependent instructions to continue executing while a memory stall has occurred. The simulation results show significant improvement in execution throughput of approximately 11%, while having a minimal impact on area overhead and power.

Alex Orailoglu | Garo Bournoutian

[1] Miodrag Potkonjak,et al. MediaBench: a tool for evaluating and synthesizing multimedia and communications systems , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[2] Anoop Gupta,et al. Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[3] Tilak Agerwala,et al. Proceedings of the 42nd Annual International Symposium on Computer Architecture , 1985, ISCA 1985.

[4] Andrew R. Pleszkun,et al. Implementation of precise interrupts in pipelined processors , 1985, ISCA '98.

[5] Maurice V. Wilkes,et al. The memory gap and the future of high performance memories , 2001, CARN.

[6] J. P. Grossman. Cheap out-of-order execution using delayed issue , 2000, Proceedings 2000 International Conference on Computer Design.

[7] Trevor Mudge,et al. MiBench: A free, commercially representative embedded benchmark suite , 2001 .

[8] R. M. Tomasulo,et al. An efficient algorithm for exploiting multiple arithmetic units , 1995 .

[9] Norman P. Jouppi,et al. CACTI: an enhanced cache access and cycle time model , 1996, IEEE J. Solid State Circuits.

[10] Kunle Olukotun,et al. The case for a single-chip multiprocessor , 1996, ASPLOS VII.

[11] Jean-Loup Baer,et al. An effective on-chip preloading scheme to reduce data access penalty , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[12] Todd M. Austin,et al. SimpleScalar: An Infrastructure for Computer System Modeling , 2002, Computer.

[13] Ken Kennedy,et al. Software prefetching , 1991, ASPLOS IV.

[14] Srikanth Kannan,et al. MPEG 4 Video Codec on a Wireless Handset Baseband System , 2004 .

[15] Antonio González,et al. Energy-effective issue logic , 2001, ISCA 2001.

[16] Norman P. Jouppi,et al. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[17] Donald Yeung,et al. Evaluating the impact of memory system performance on software prefetching and locality optimizations , 2001, ICS '01.

[18] J.W.C. Fu,et al. Stride Directed Prefetching In Scalar Processors , 1992, [1992] Proceedings the 25th Annual International Symposium on Microarchitecture MICRO 25.

[19] H. Levy,et al. An architecture for software-controlled data prefetching , 1991, [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture.

[20] Dirk Grunwald,et al. Prefetching Using Markov Predictors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[21] Aviral Shrivastava,et al. Hiding Cache Miss Penalty Using Priority-based Execution for Embedded Processors , 2008, 2008 Design, Automation and Test in Europe.

[22] Alex Orailoglu,et al. Miss reduction in embedded processors through dynamic, power-friendly cache design , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[23] Eric Sprangle,et al. Increasing processor performance by implementing deeper pipelines , 2002, ISCA.

[24] André Seznec,et al. Out-of-order execution may not be cost-effective on processors featuring simultaneous multithreading , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.