SWOOP: software-hardware co-design for non-speculative, execute-ahead, in-order cores

Increasing demands for energy efficiency constrain emerging hardware. These new hardware trends challenge the established assumptions in code generation and force us to rethink existing software optimization techniques. We propose a cross-layer redesign of the way compilers and the underlying microarchitecture are built and interact, to achieve both performance and high energy efficiency. In this paper, we address one of the main performance bottlenecks — last-level cache misses — through a software-hardware co-design. Our approach is able to hide memory latency and attain increased memory and instruction level parallelism by orchestrating a non-speculative, execute-ahead paradigm in software (SWOOP). While out-of-order (OoO) architectures attempt to hide memory latency by dynamically reordering instructions, they do so through expensive, power-hungry, speculative mechanisms.We aim to shift this complexity into software, and we build upon compilation techniques inherited from VLIW, software pipelining, modulo scheduling, decoupled access-execution, and software prefetching. In contrast to previous approaches we do not rely on either software or hardware speculation that can be detrimental to efficiency. Our SWOOP compiler is enhanced with lightweight architectural support, thus being able to transform applications that include highly complex control-flow and indirect memory accesses.

[1]  Sebastian Winkel,et al.  Latency-tolerant software pipelining in a production compiler , 2008, CGO '08.

[2]  Chi-Keung Luk,et al.  Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.

[3]  Scott B. Baden,et al.  Redefining the Role of the CPU in the Era of CPU-GPU Integration , 2012, IEEE Micro.

[4]  Margaret Martonosi,et al.  DeSC: Decoupled supply-compute communication management for heterogeneous architectures , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[5]  David I. August,et al.  Decoupled software pipelining with the synchronization array , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[6]  Victor V. Zyuban,et al.  Inherently Lower-Power High-Performance Superscalar Architectures , 2001, IEEE Trans. Computers.

[7]  Carole-Jean Wu,et al.  SHiP: Signature-based Hit Predictor for high performance caching , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[8]  Srinivas Devadas,et al.  IMP: Indirect memory prefetcher , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[9]  Andrew A. Chien,et al.  The future of microprocessors , 2011, Commun. ACM.

[10]  Gurindar S. Sohi,et al.  Speculative data-driven multithreading , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[11]  Onur Mutlu,et al.  Runahead execution: an alternative to very large instruction windows for out-of-order processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[12]  Weifeng Zhang,et al.  Accelerating and Adapting Precomputation Threads for Effcient Prefetching , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[13]  Jan Reineke,et al.  Ascertaining Uncertainty for Efficient Exact Cache Analysis , 2017, CAV.

[14]  Henk Corporaal,et al.  High-level software-pipelining in LLVM , 2015, SCOPES.

[15]  Alexander Aiken,et al.  Resource-Constrained Software Pipelining , 1995, IEEE Trans. Parallel Distributed Syst..

[16]  Sam Ainsworth,et al.  Software prefetching for indirect memory accesses , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[17]  Gurindar S. Sohi,et al.  Multiscalar processors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[18]  David I. August,et al.  Decoupled software pipelining with the synchronization array , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[19]  Carlo H. Séquin,et al.  Design and Implementation of RISC I , 1982 .

[20]  Yale N. Patt,et al.  Achieving Out-of-Order Performance with Almost In-Order Complexity , 2008, 2008 International Symposium on Computer Architecture.

[21]  Dean M. Tullsen,et al.  Inter-core prefetching for multicore processors using migrating helper threads , 2011, ASPLOS XVI.

[22]  Stéphan Jourdan,et al.  Speculation techniques for improving load related instruction scheduling , 1999, ISCA.

[23]  Eric Rotenberg,et al.  Slipstream processors: improving both performance and fault tolerance , 2000, SIGP.

[24]  Pen-Chung Yew,et al.  A Scheme to Enforce Data Dependence on Large Multiprocessor Systems , 1987, IEEE Trans. Software Eng..

[25]  Margaret Martonosi,et al.  Informing Memory Operations: Providing Memory Performance Feedback in Modern Processors , 1996, ISCA.

[26]  Marc Tremblay,et al.  A Third-Generation 65nm 16-Core 32-Thread Plus 32-Scout-Thread CMT SPARC® Processor , 2008, 2008 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.

[27]  Scott A. Mahlke,et al.  Effective compiler support for predicated execution using the hyperblock , 1992, MICRO 25.

[28]  Trevor Mudge,et al.  Improving data cache performance by pre-executing instructions under a cache miss , 1997 .

[29]  M. Hill,et al.  Weak ordering-a new definition , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[30]  Manoj Franklin,et al.  The multiscalar architecture , 1993 .

[31]  Robert H. Dennard,et al.  A 30 Year Retrospective on Dennard's MOSFET Scaling Paper , 2007 .

[32]  Richard W. Vuduc,et al.  When Prefetching Works, When It Doesn’t, and Why , 2012, TACO.

[33]  Dean M. Tullsen,et al.  Mitosis compiler: an infrastructure for speculative threading based on pre-computation slices , 2005, PLDI '05.

[34]  Thomas M. Conte,et al.  High-performance and low-cost dual-thread VLIW processor using Weld architecture paradigm , 2005, IEEE Transactions on Parallel and Distributed Systems.

[35]  Stefanos Kaxiras,et al.  Non-speculative load-load reordering in TSO , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[36]  Haitham Akkary,et al.  Continual flow pipelines , 2004, ASPLOS XI.

[37]  Stijn Eyerman,et al.  An Evaluation of High-Level Mechanistic Core Models , 2014, ACM Trans. Archit. Code Optim..

[38]  David Black-Schaffer,et al.  Navigating the cache hierarchy with a single lookup , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[39]  B. Ramakrishna Rau,et al.  Data Flow and Dependence Analysis for Instruction Level Parallelism , 1991, LCPC.

[40]  Eric Rotenberg,et al.  Control-Flow Decoupling: An Approach for Timely, Non-Speculative Branching , 2015, IEEE Transactions on Computers.

[41]  David W. Binkley,et al.  Program slicing , 2008, 2008 Frontiers of Software Maintenance.

[42]  Onur Mutlu,et al.  A Case for MLP-Aware Cache Replacement , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[43]  Lieven Eeckhout,et al.  Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[44]  Haitham Akkary,et al.  A simple latency tolerant processor , 2008, 2008 IEEE International Conference on Computer Design.

[45]  Erik Hagersten,et al.  Resource conscious prefetching for irregular applications in multicores , 2014, 2014 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV).

[46]  B. R. Rau,et al.  HPL-PD Architecture Specification:Version 1.1 , 2000 .

[47]  Stefanos Kaxiras,et al.  Multiversioned decoupled access-execute: the key to energy-efficient compilation of general-purpose programs , 2016, CC.

[48]  M. Dubois,et al.  Assisted Execution , 1998 .

[49]  Weng-Fai Wong,et al.  Static identification of delinquent loads , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[50]  Lieven Eeckhout,et al.  The Load Slice Core microarchitecture , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[51]  Erik Hagersten,et al.  A Case for Resource Efficient Prefetching in Multicores , 2014, ICPP.

[52]  Jung Ho Ahn,et al.  The McPAT Framework for Multicore and Manycore Architectures: Simultaneously Modeling Power, Area, and Timing , 2013, TACO.

[53]  John Paul Shen,et al.  Dynamic speculative precomputation , 2001, MICRO.

[54]  Aleksandar Milenkovic,et al.  Experiment flows and microbenchmarks for reverse engineering of branch predictor structures , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[55]  Guilherme Ottoni,et al.  Automatic thread extraction with decoupled software pipelining , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[56]  Koji Nii,et al.  A 28 nm High-k/MG Heterogeneous Multi-Core Mobile Application Processor With 2 GHz Cores and Low-Power 1 GHz Cores , 2015, IEEE Journal of Solid-State Circuits.

[57]  Krishna V. Palem,et al.  Adaptive Compiler Directed Prefetching for EPIC Processors , 2004, PDPTA.

[58]  David J. Lilja,et al.  Data prefetch mechanisms , 2000, CSUR.

[59]  Craig Zilles,et al.  Execution-based prediction using speculative slices , 2001, ISCA 2001.

[60]  Sanjay J. Patel,et al.  OUTRIDER: Efficient memory latency tolerance with decoupled strands , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[61]  David Black-Schaffer,et al.  Fix the code. Don't tweak the hardware: A new compiler approach to Voltage-Frequency scaling , 2014, CGO '14.

[62]  Santosh Nagarakatte,et al.  iCFP: Tolerating All-Level Cache Misses in In-Order Processors , 2010, IEEE Micro.

[63]  John Paul Shen,et al.  Speculative precomputation: long-range prefetching of delinquent loads , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.

[64]  David A. Patterson,et al.  Computer Architecture, Fifth Edition: A Quantitative Approach , 2011 .

[65]  Marc Tremblay,et al.  Simultaneous speculative threading: a novel pipeline architecture implemented in sun's rock processor , 2009, ISCA '09.

[66]  Gurindar S. Sohi,et al.  Task selection for a multiscalar processor , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[67]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.

[68]  Amir Roth,et al.  BOLT: Energy-efficient Out-of-Order Latency-Tolerant execution , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[69]  Juan Touriño,et al.  An Inspector-Executor Algorithm for Irregular Assignment Parallelization , 2004, ISPA.

[70]  David Black-Schaffer,et al.  AREP: Adaptive Resource Efficient Prefetching for Maximizing Multicore Performance , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[71]  Antonio González,et al.  Energy-effective issue logic , 2001, ISCA 2001.

[72]  Stefanos Kaxiras,et al.  Clairvoyance: Look-ahead compile-time scheduling , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[73]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[74]  Vijayalakshmi Srinivasan,et al.  Exploring the limits of prefetching , 2005, IBM J. Res. Dev..