Filtered runahead execution with a runahead buffer

Runahead execution dynamically expands the instruction window of an out of order processor to generate memory level parallelism (MLP) while the core would otherwise be stalled. Unfortunately, runahead has the disadvantage of requiring the front-end to remain active to supply instructions. We propose a new structure (the Runahead Buffer) for supplying these instructions. We note that cache misses are often caused by repetitive, short dependence chains. We store these dependence chains in the runahead buffer. During runahead, the runahead buffer is used to supply instructions. This generates 2× more MLP than traditional runahead on average because the core can run further ahead. It also saves energy since the front-end can be clock-gated, reducing dynamic energy consumption. Over a no-prefetching/prefetching baseline, the result is a performance benefit of 17.2%/7.8% and an energy reduction of 6.7%/4.5% respectively. Traditional runahead with additional energy optimizations results in a performance benefit of 12.1%/5.9% but an energy increase of 9.5%/5.4%. Finally, we propose a hybrid policy that switches between the runahead buffer and traditional runahead, maximizing performance.

[1]  David R. Kaeli,et al.  Multi2Sim: A simulation framework for CPU-GPU computing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[2]  Chi-Keung Luk,et al.  Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.

[3]  Onur Mutlu,et al.  Runahead execution: an alternative to very large instruction windows for out-of-order processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[4]  Weifeng Zhang,et al.  Accelerating and Adapting Precomputation Threads for Effcient Prefetching , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[5]  Onur Mutlu,et al.  Understanding the effects of wrong-path memory references on processor performance , 2004, WMPI '04.

[6]  Craig Zilles,et al.  Execution-based prediction using speculative slices , 2001, ISCA 2001.

[7]  Donald Yeung,et al.  Design and evaluation of compiler algorithms for pre-execution , 2002, ASPLOS X.

[8]  StarkJared,et al.  Simultaneous subordinate microthreading (SSMT) , 1999 .

[9]  John Paul Shen,et al.  Speculative Precomputation on Chip Multiprocessors , 2002 .

[10]  Balaram Sinharoy,et al.  POWER4 system microarchitecture , 2002, IBM J. Res. Dev..

[11]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[12]  Onur Mutlu,et al.  Runahead Execution: An Effective Alternative to Large Instruction Windows , 2003, IEEE Micro.

[13]  Maurice V. Wilkes,et al.  The memory gap and the future of high performance memories , 2001, CARN.

[14]  Onur Mutlu,et al.  Techniques for efficient processing in runahead execution engines , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[15]  John Paul Shen,et al.  Dynamic speculative precomputation , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[16]  Haitham Akkary,et al.  Continual flow pipelines , 2004, ASPLOS XI.

[17]  James E. Smith,et al.  The microarchitecture of superscalar processors , 1995, Proc. IEEE.

[18]  James E. Smith,et al.  Data Cache Prefetching Using a Global History Buffer , 2005, IEEE Micro.

[19]  Eric Rotenberg,et al.  Slipstream processors: improving both performance and fault tolerance , 2000, SIGP.

[20]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[21]  Amir Roth,et al.  BOLT: Energy-efficient Out-of-Order Latency-Tolerant execution , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[22]  Dirk Grunwald,et al.  A stateless, content-directed data prefetching mechanism , 2002, ASPLOS X.

[23]  Trevor Mudge,et al.  Improving data cache performance by pre-executing instructions under a cache miss , 1997 .

[24]  Babak Falsafi,et al.  Dead-block prediction & dead-block correlating prefetchers , 2001, ISCA 2001.

[25]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[26]  Onur Mutlu,et al.  Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[27]  Thomas F. Wenisch,et al.  Spatial Memory Streaming , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[28]  Andreas Moshovos,et al.  Dependence based prefetching for linked data structures , 1998, ASPLOS VIII.

[29]  Dirk Grunwald,et al.  Prefetching Using Markov Predictors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[30]  Dean M. Tullsen,et al.  Inter-core prefetching for multicore processors using migrating helper threads , 2011, ASPLOS XVI.

[31]  Norman P. Jouppi,et al.  CACTI 6.0: A Tool to Model Large Caches , 2009 .

[32]  John Paul Shen,et al.  Speculative precomputation: long-range prefetching of delinquent loads , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.

[33]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[34]  Wei-Chung Hsu,et al.  Dynamic helper threaded prefetching on the Sun UltraSPARC/spl reg/ CMP processor , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).