Long term parking (LTP): Criticality-aware resource allocation in OOO processors

Modern processors employ large structures (IQ, LSQ, register file, etc.) to expose instruction-level parallelism (ILP) and memory-level parallelism (MLP). These resources are typically allocated to instructions in program order. This wastes resources by allocating resources to instructions that are not yet ready to be executed and by eagerly allocating resources to instructions that are not part of the application's critical path. This work explores the possibility of allocating pipeline resources only when needed to expose MLP, and thereby enabling a processor design with significantly smaller structures, without sacrificing performance. First we identify the classes of instructions that should not reserve resources in program order and evaluate the potential performance gains we could achieve by delaying their allocations. We then use this information to "park" such instructions in a simpler, and therefore more efficient, Long Term Parking (LTP) structure. The LTP stores instructions until they are ready to execute, without allocating pipeline resources, and thereby keeps the pipeline available for instructions that can generate further MLP. LTP can accurately and rapidly identify which instructions to park, park them before they execute, wake them when needed to preserve performance, and do so using a simple queue instead of a complex IQ. We show that even a very simple queue-based LTP design allows us to significantly reduce IQ (64 → 32) and register file (128 → 96) sizes while retaining MLP performance and improving energy efficiency.

[1]  T. Austin,et al.  Cyclone: a broadcast-free dynamic instruction scheduler with selective replay , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[2]  Alvin M. Despain,et al.  The 16-fold way: a microparallel taxonomy , 1993, MICRO 1993.

[3]  Norman P. Jouppi,et al.  CACTI 6.0: A Tool to Model Large Caches , 2009 .

[4]  Enric Morancho,et al.  Recovery mechanism for latency misprediction , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[5]  R. Iris Bahar,et al.  The non-critical buffer: using load latency tolerance to improve data cache efficiency , 1999, Proceedings 1999 IEEE International Conference on Computer Design: VLSI in Computers and Processors (Cat. No.99CB37040).

[6]  Sanjay J. Patel,et al.  Instruction fetch deferral using static slack , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[7]  Lieven Eeckhout,et al.  The Load Slice Core microarchitecture , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[8]  Dirk Grunwald,et al.  Predictive sequential associative cache , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[9]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[10]  Mateo Valero,et al.  Toward kilo-instruction processors , 2004, TACO.

[11]  Brad Calder,et al.  Dynamic prediction of critical path instructions , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[12]  Huiyang Zhou,et al.  Dual-core execution: building a highly scalable single-thread instruction window , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[13]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.

[14]  Stamatis Vassiliadis,et al.  Register renaming and dynamic speculation: an alternative approach , 1993, MICRO.

[15]  Mateo Valero,et al.  Delaying physical register allocation through virtual-physical registers , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[16]  Amir Roth,et al.  BOLT: Energy-efficient Out-of-Order Latency-Tolerant execution , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[17]  Simha Sethumadhavan,et al.  Late-binding: enabling unordered load-store queues , 2007, ISCA '07.

[18]  Masahiro Goshima,et al.  A high-speed dynamic instruction scheduling scheme for superscalar processors , 2001, MICRO.

[19]  Larry L. Biro,et al.  Power considerations in the design of the Alpha 21264 microprocessor , 1998, Proceedings 1998 Design and Automation Conference. 35th DAC. (Cat. No.98CH36175).

[20]  James E. Smith,et al.  Complexity-Effective Superscalar Processors , 1997, ISCA.

[21]  Rastislav Bodík,et al.  Focusing processor policies via critical-path prediction , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.

[22]  Enric Morancho,et al.  On reducing energy-consumption by late-inserting instructions into the issue queue , 2007, Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07).

[23]  Tong Li,et al.  A large, fast instruction window for tolerating cache misses , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[24]  Haitham Akkary,et al.  Scalable load and store processing in latency tolerant processors , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[25]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[26]  Hideki Ando,et al.  MLP-aware dynamic instruction window resizing for adaptively exploiting both ILP and MLP , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[27]  Amir Roth,et al.  Decoupled store completion/silent deterministic replay: enabling scalable data memory for CPR/CFP processors , 2009, ISCA '09.

[28]  Trevor Mudge,et al.  Improving data cache performance by pre-executing instructions under a cache miss , 1997 .

[29]  Haitham Akkary,et al.  Continual flow pipelines , 2004, ASPLOS XI.

[30]  Onur Mutlu,et al.  Runahead execution: an alternative to very large instruction windows for out-of-order processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[31]  Nader Bagherzadeh,et al.  A scalable register file architecture for dynamically scheduled processors , 1996, Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique.