Reducing impact of cache miss stalls in embedded systems by extracting guaranteed independent instructions

Today, embedded processors are expected to be able to run complex, algorithm-heavy, memory-intensive applications that were originally designed and coded for general-purpose processors. As such, the impact of memory latencies on the execution time increasingly becomes evident. All the while, it is also expected that embedded processors be power-conscientious as well as of minimal area impact. As a result, traditional methods for addressing performance and memory latencies, such as multiple issue, out-of-order execution and large, associative caches, are not aptly suited for the embedded domain due to the significant area and power overhead. This paper explores a novel approach to mitigating execution delays caused by memory latencies that would otherwise not be possible in a regular in-order, single-issue embedded processor without large, power-hungry constructs like a Reorder Buffer (ROB). The concept relies on both compile-time and run-time information to safely allow non-data-dependent instructions to continue executing while a memory stall has occurred. The simulation results show significant improvement in execution throughput of approximately 11%, while having a minimal impact on area overhead and power.

[1]  Miodrag Potkonjak,et al.  MediaBench: a tool for evaluating and synthesizing multimedia and communications systems , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[2]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[3]  Tilak Agerwala,et al.  Proceedings of the 42nd Annual International Symposium on Computer Architecture , 1985, ISCA 1985.

[4]  Andrew R. Pleszkun,et al.  Implementation of precise interrupts in pipelined processors , 1985, ISCA '98.

[5]  Maurice V. Wilkes,et al.  The memory gap and the future of high performance memories , 2001, CARN.

[6]  J. P. Grossman Cheap out-of-order execution using delayed issue , 2000, Proceedings 2000 International Conference on Computer Design.

[7]  Trevor Mudge,et al.  MiBench: A free, commercially representative embedded benchmark suite , 2001 .

[8]  R. M. Tomasulo,et al.  An efficient algorithm for exploiting multiple arithmetic units , 1995 .

[9]  Norman P. Jouppi,et al.  CACTI: an enhanced cache access and cycle time model , 1996, IEEE J. Solid State Circuits.

[10]  Kunle Olukotun,et al.  The case for a single-chip multiprocessor , 1996, ASPLOS VII.

[11]  Jean-Loup Baer,et al.  An effective on-chip preloading scheme to reduce data access penalty , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[12]  Todd M. Austin,et al.  SimpleScalar: An Infrastructure for Computer System Modeling , 2002, Computer.

[13]  Ken Kennedy,et al.  Software prefetching , 1991, ASPLOS IV.

[14]  Srikanth Kannan,et al.  MPEG 4 Video Codec on a Wireless Handset Baseband System , 2004 .

[15]  Antonio González,et al.  Energy-effective issue logic , 2001, ISCA 2001.

[16]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[17]  Donald Yeung,et al.  Evaluating the impact of memory system performance on software prefetching and locality optimizations , 2001, ICS '01.

[18]  J.W.C. Fu,et al.  Stride Directed Prefetching In Scalar Processors , 1992, [1992] Proceedings the 25th Annual International Symposium on Microarchitecture MICRO 25.

[19]  H. Levy,et al.  An architecture for software-controlled data prefetching , 1991, [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture.

[20]  Dirk Grunwald,et al.  Prefetching Using Markov Predictors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[21]  Aviral Shrivastava,et al.  Hiding Cache Miss Penalty Using Priority-based Execution for Embedded Processors , 2008, 2008 Design, Automation and Test in Europe.

[22]  Alex Orailoglu,et al.  Miss reduction in embedded processors through dynamic, power-friendly cache design , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[23]  Eric Sprangle,et al.  Increasing processor performance by implementing deeper pipelines , 2002, ISCA.

[24]  André Seznec,et al.  Out-of-order execution may not be cost-effective on processors featuring simultaneous multithreading , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.