Techniques for efficient processing in runahead execution engines

Runahead execution is a technique that improves processor performance by pre-executing the running application instead of stalling the processor when a long-latency cache miss occurs. Previous research has shown that this technique significantly improves processor performance. However, the efficiency of runahead execution, which directly affects the dynamic energy consumed by a runahead processor, has not been explored. A runahead processor executes significantly more instructions than a traditional out-of-order processor, sometimes without providing any performance benefit, which makes it inefficient. In this paper, we describe the causes of inefficiency in runahead execution and propose techniques to make a runahead processor more efficient, thereby reducing its energy consumption and possibly increasing its performance. Our analyses and results provide two major insights: (1) the efficiency of runahead execution can be greatly improved with simple techniques that reduce the number of short, overlapping, and useless runahead periods, which we identify as the three major causes of inefficiency; (2) simple optimizations targeting the increase of useful prefetches generated in runahead mode can increase both the performance and efficiency of a runahead processor. The techniques we propose reduce the increase in the number of instructions executed due to runahead execution from 26.5% to 6.2%, on average, without significantly affecting the performance improvement provided by runahead execution.

[1]  Peter J. Denning,et al.  Working Sets Past and Present , 1980, IEEE Transactions on Software Engineering.

[2]  Trevor Mudge,et al.  Improving data cache performance by pre-executing instructions under a cache miss , 1997 .

[3]  Lockup-free instruction fetch/prefetch cache organization , 1981, ISCA '98.

[4]  Dirk Grunwald,et al.  Pipeline gating: speculation control for energy reduction , 1998, ISCA.

[5]  Andrew F. Glew MLP yes! ILP no , 1998, ASPLOS 1998.

[6]  Simultaneous subordinate microthreading (SSMT) , 1999, Proceedings of the 26th International Symposium on Computer Architecture (Cat. No.99CB36367).

[7]  John Paul Shen,et al.  Dynamic speculative precomputation , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[8]  A. J. KleinOsowski,et al.  MinneSPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research , 2002, IEEE Computer Architecture Letters.

[9]  Balaram Sinharoy,et al.  POWER4 system microarchitecture , 2002, IBM J. Res. Dev..

[10]  Onur Mutlu,et al.  Runahead Execution: An Effective Alternative to Large Instruction Windows , 2003, IEEE Micro.

[11]  R. D. Valentine,et al.  The Intel Pentium M processor: Microarchitecture and performance , 2003 .

[12]  Onur Mutlu,et al.  Runahead execution: an alternative to very large instruction windows for out-of-order processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[13]  Brian Fahs,et al.  Microarchitecture optimizations for exploiting memory-level parallelism , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[14]  Santosh G. Abraham,et al.  Effective stream-based and execution-based data prefetching , 2004, ICS '04.

[15]  Onur Mutlu,et al.  On Reusing the Results of Pre-Executed Instructions in a Runahead Execution Processor , 2005, IEEE Computer Architecture Letters.