Continuous runahead: Transparent hardware acceleration for memory intensive workloads

Runahead execution pre-executes the application's own code to generate new cache misses. This pre-execution results in prefetch requests that are overwhelmingly accurate (95% in a realistic system configuration for the memory intensive SPEC CPU2006 benchmarks), much more so than a global history buffer (GHB) or stream prefetcher (by 13%/19%). However, we also find that current runahead techniques are very limited in coverage: they prefetch only a small fraction (13%) of all runahead-reachable cache misses. This is because runahead intervals are short and limited by the duration of each full-window stall. In this work, we explore removing the constraints that lead to these short intervals. We dynamically filter the instruction stream to identify the chains of operations that cause the pipeline to stall. These operations are renamed to execute speculatively in a loop and are then migrated to a Continuous Runahead Engine (CRE), a shared multi-core accelerator located at the memory controller. The CRE runs ahead with the chain continuously, increasing prefetch coverage to 70% of runahead-reachable cache misses. The result is a 43.3% weighted speedup gain on a set of memory intensive quad-core workloads and a significant reduction in system energy consumption. This is a 21.9% performance gain over the Runahead Buffer, a state-of-the-art runahead proposal and a 13.2%/13.5% gain over GHB/stream prefetching. When the CRE is combined with GHB prefetching, we observe a 23.5% gain over a baseline with GHB prefetching alone.

[1]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[2]  Richard E. Kessler,et al.  Evaluating stream buffers as a secondary cache replacement , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[3]  Trevor Mudge,et al.  Improving data cache performance by pre-executing instructions under a cache miss , 1997 .

[4]  Dirk Grunwald,et al.  Prefetching Using Markov Predictors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[5]  Andreas Moshovos,et al.  Dependence based prefetching for linked data structures , 1998, ASPLOS VIII.

[6]  Simultaneous subordinate microthreading (SSMT) , 1999, Proceedings of the 26th International Symposium on Computer Architecture (Cat. No.99CB36367).

[7]  Dean M. Tullsen,et al.  Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000, SIGP.

[8]  Eric Rotenberg,et al.  Slipstream processors: improving both performance and fault tolerance , 2000, SIGP.

[9]  David J. Sager,et al.  The microarchitecture of the Pentium 4 processor , 2001 .

[10]  Jignesh M. Patel,et al.  Data prefetching by dependence graph precomputation , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.

[11]  Craig Zilles,et al.  Execution-based prediction using speculative slices , 2001, ISCA 2001.

[12]  Babak Falsafi,et al.  Dead-block prediction & dead-block correlating prefetchers , 2001, ISCA 2001.

[13]  John Paul Shen,et al.  Speculative precomputation: long-range prefetching of delinquent loads , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.

[14]  John Paul Shen,et al.  Dynamic speculative precomputation , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[15]  Chi-Keung Luk,et al.  Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.

[16]  John Paul Shen,et al.  Speculative Precomputation on Chip Multiprocessors , 2002 .

[17]  Dirk Grunwald,et al.  A stateless, content-directed data prefetching mechanism , 2002, ASPLOS X.

[18]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[19]  Balaram Sinharoy,et al.  POWER4 system microarchitecture , 2002, IBM J. Res. Dev..

[20]  Donald Yeung,et al.  Design and evaluation of compiler algorithms for pre-execution , 2002, ASPLOS X.

[21]  Onur Mutlu,et al.  Runahead Execution: An Effective Alternative to Large Instruction Windows , 2003, IEEE Micro.

[22]  Onur Mutlu,et al.  Runahead execution: an alternative to very large instruction windows for out-of-order processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[23]  James E. Smith,et al.  Data Cache Prefetching Using a Global History Buffer , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[24]  Haitham Akkary,et al.  Continual flow pipelines , 2004, ASPLOS XI.

[25]  Onur Mutlu,et al.  On Reusing the Results of Pre-Executed Instructions in a Runahead Execution Processor , 2005, IEEE Computer Architecture Letters.

[26]  Onur Mutlu,et al.  Address-value delta (AVD) prediction: increasing the effectiveness of runahead execution by exploiting regular memory allocation patterns , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[27]  Wei-Chung Hsu,et al.  Dynamic helper threaded prefetching on the Sun UltraSPARC/spl reg/ CMP processor , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[28]  Onur Mutlu,et al.  Techniques for efficient processing in runahead execution engines , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[29]  Huiyang Zhou,et al.  Dual-core execution: building a highly scalable single-thread instruction window , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[30]  Sanjay J. Patel,et al.  Beating in-order stalls with "flea-flicker" two-pass pipelining , 2006, IEEE Transactions on Computers.

[31]  Thomas F. Wenisch,et al.  Spatial Memory Streaming , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[32]  Onur Mutlu,et al.  Efficient runahead execution processors , 2006 .

[33]  Onur Mutlu,et al.  Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance , 2006, IEEE Micro.

[34]  Onur Mutlu,et al.  A Case for MLP-Aware Cache Replacement , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[35]  Weifeng Zhang,et al.  Accelerating and Adapting Precomputation Threads for Effcient Prefetching , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[36]  Onur Mutlu,et al.  Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[37]  Onur Mutlu,et al.  Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems , 2008, 2008 International Symposium on Computer Architecture.

[38]  Stijn Eyerman,et al.  System-Level Performance Metrics for Multiprogram Workloads , 2008, IEEE Micro.

[39]  Michael C. Huang,et al.  A performance-correctness explicitly-decoupled architecture , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[40]  Mateo Valero,et al.  Runahead Threads to improve SMT performance , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[41]  Norman P. Jouppi,et al.  CACTI 6.0: A Tool to Model Large Caches , 2009 .

[42]  Stijn Eyerman,et al.  MLP-Aware Runahead Threads in a Simultaneous Multithreading Processor , 2008, HiPEAC.

[43]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[44]  Onur Mutlu,et al.  Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[45]  Mateo Valero,et al.  Efficient Runahead Threads , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[46]  Dean M. Tullsen,et al.  Inter-core prefetching for multicore processors using migrating helper threads , 2011, ASPLOS XVI.

[47]  David R. Kaeli,et al.  Multi2Sim: A simulation framework for CPU-GPU computing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[48]  Yale N. Patt,et al.  Filtered runahead execution with a runahead buffer , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[49]  Onur Mutlu,et al.  Accelerating Dependent Cache Misses with an Enhanced Memory Controller , 2016, ISCA.

[50]  Milad Hashemi,et al.  On-Chip Mechanisms to Reduce Effective Memory Access Latency , 2016, ArXiv.