Runahead execution: an alternative to very large instruction windows for out-of-order processors

Today's high performance processors tolerate long latency operations by means of out-of-order execution. However, as latencies increase, the size of the instruction window must increase even faster if we are to continue to tolerate these latencies. We have already reached the point where the size of an instruction window that can handle these latencies is prohibitively large in terms of both design complexity and power consumption. And, the problem is getting worse. This paper proposes runahead execution as an effective way to increase memory latency tolerance in an out-of-order processor without requiring an unreasonably large instruction window. Runahead execution unblocks the instruction window blocked by long latency operations allowing the processor to execute far ahead in the program path. This results in data being prefetched into caches long before it is needed. On a machine model based on the Intel/spl reg/ Pentium/spl reg/ processor, having a 128-entry instruction window, adding runahead execution improves the IPC (instructions per cycle) by 22% across a wide range of memory intensive applications. Also, for the same machine model, runahead execution combined with a 128-entry window performs within 1% of a machine with no runahead execution and a 384-entry instruction window.

[1]  Jean-Loup Baer,et al.  An effective on-chip preloading scheme to reduce data access penalty , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[2]  John Paul Shen,et al.  Dynamic speculative precomputation , 2001, MICRO.

[3]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[4]  David Kroft,et al.  Lockup-free instruction fetch/prefetch cache organization , 1998, ISCA '81.

[5]  Douglas J. Joseph,et al.  Prefetching Using Markov Predictors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[6]  Joel S. Emer,et al.  Memory dependence prediction using store sets , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[7]  José González,et al.  Dual path instruction processing , 2002, ICS '02.

[8]  Trevor N. Mudge,et al.  Author retrospective improving data cache performance by pre-executing instructions under a cache miss , 1997, International Conference on Supercomputing.

[9]  Eric Sprangle,et al.  Increasing processor performance by implementing deeper pipelines , 2002, ISCA.

[10]  Trevor Mudge,et al.  Improving processor performance by dynamically pre-processing the instruction stream , 1998 .

[11]  Ken Kennedy,et al.  Software prefetching , 1991, ASPLOS IV.

[12]  Richard E. Kessler,et al.  The Alpha 21264 microprocessor , 1999, IEEE Micro.

[13]  Eric Rotenberg,et al.  A large, fast instruction window for tolerating cache misses , 2002, ISCA.

[14]  Andreas Moshovos,et al.  Streamlining inter-operation memory communication via data dependence prediction , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[15]  Dirk Grunwald,et al.  Content-sensitive data prefetching , 2002 .

[16]  Craig Zilles,et al.  Execution-based prediction using speculative slices , 2001, ISCA 2001.

[17]  Kenneth C. Yeager The Mips R10000 superscalar microprocessor , 1996, IEEE Micro.

[18]  Todd C. Mowry,et al.  Compiler-based prefetching for recursive data structures , 1996, ASPLOS VII.

[19]  Chi-Keung Luk,et al.  Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.

[20]  Daniel A. Jiménez,et al.  Dynamic branch prediction with perceptrons , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[21]  SprangleEric,et al.  Increasing processor performance by implementing deeper pipelines , 2002 .

[22]  David J. Sager,et al.  The microarchitecture of the Pentium 4 processor , 2001 .

[23]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[24]  Maurice V. Wilkes,et al.  Slave Memories and Dynamic Storage Allocation , 1965, IEEE Trans. Electron. Comput..

[25]  Rajeev Balasubramonian,et al.  Dynamically allocating processor resources between nearby and distant ILP , 2001, ISCA 2001.

[26]  Yale N. Patt,et al.  Simultaneous subordinate microthreading (SSMT) , 1999, ISCA.