Orinoco: Ordered Issue and Unordered Commit with Non-Collapsible Queues

Modern out-of-order processors call for more aggressive scheduling techniques such as priority scheduling and out-of-order commit to make use of increasing core resources. Since these approaches prioritize the issue or commit of certain instructions, they face the conundrum of providing the capacity efficiency of scheduling structures while preserving the ideal ordering of instructions. Traditional collapsible queues are too expensive for today's processors, while state-of-the-art queue designs compromise with the pseudo-ordering of instructions, leading to performance degradation as well as other limitations. In this paper, we present Orinoco, a microarchitecture/circuit co-design that supports ordered issue and unordered commit with non-collapsible queues. We decouple the temporal ordering of instructions from their queue positions by introducing an age matrix with the bit count encoding, along with a commit dependency matrix and a memory disambiguation matrix to determine instructions to prioritize issue or commit. We leverage the Processing-in-Memory (PIM) approach and efficiently implement the matrix schedulers as 8T SRAM arrays. Orinoco achieves an average IPC improvement of 14.8% over the baseline in-order commit core with the state-of-the-art scheduler while incurring overhead equivalent to a few kilobytes of SRAM.

[1]  Myung Kuk Yoon,et al.  Reconstructing Out-of-Order Issue Queue , 2022, 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO).

[2]  S. Kaxiras,et al.  Free atomics: hardware atomic operations without fences , 2022, ISCA.

[3]  M. C. Jeffrey,et al.  A scalable architecture for reprioritizing ordered parallelism , 2022, ISCA.

[4]  Mike Clark,et al.  The AMD Next-Generation “Zen 3” Core , 2022, IEEE Micro.

[5]  A. Yoaz,et al.  Intel Alder Lake CPU Architectures , 2022, IEEE Micro.

[6]  Heiner Litz,et al.  CRISP: critical slice prefetching , 2022, ASPLOS.

[7]  Narayanan Vijaykrishnan,et al.  Microprocessor at 50: Industry Leaders Speak , 2021, IEEE Micro.

[8]  L. Rizzo,et al.  ghOSt: Fast & Flexible User-Space Delegation of Linux Scheduling , 2021, SOSP.

[9]  Lieven Eeckhout,et al.  TIP: Time-Proportional Instruction Profiling , 2021, MICRO.

[10]  Yale N. Patt,et al.  Criticality Driven Fetch , 2021, MICRO.

[11]  Tony Nowatzki,et al.  PolyGraph: Exposing the Value of Flexibility for Graph Processing Accelerators , 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).

[12]  Lieven Eeckhout,et al.  Vector Runahead , 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).

[13]  Trevor E. Carlson,et al.  NOREBA: a compiler-informed non-speculative out-of-order commit processor , 2021, ASPLOS.

[14]  Junning Chen,et al.  Two-Direction In-Memory Computing Based on 10T SRAM With Horizontal and Vertical Decoupled Read Ports , 2021, IEEE Journal of Solid-State Circuits.

[15]  Zhiwei Liu,et al.  CATCAM: Constant-time Alteration Ternary CAM with Scalable In-Memory Architecture , 2020, 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[16]  Lieven Eeckhout,et al.  The Forward Slice Core Microarchitecture , 2020, PACT.

[17]  David Black-Schaffer,et al.  Delay and Bypass: Ready and Criticality Aware Instruction Scheduling in Out-of-Order Processors , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[18]  Won Woo Ro,et al.  CASINO Core Microarchitecture: Generating Out-of-Order Schedules Using Cascaded In-Order Scheduling Windows , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[19]  Josep Torrellas,et al.  Understanding priority-based scheduling of graph algorithms on a shared-memory platform , 2019, SC.

[20]  Hideki Ando,et al.  SWQUE: A Mode Switching Issue Queue with Priority-Correcting Circular Queue , 2019, MICRO.

[21]  Thomas F. Wenisch,et al.  SoftSKU: Optimizing Server Architectures for Microservice Diversity @Scale , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[22]  David Black-Schaffer,et al.  FIFOrder MicroArchitecture: Ready-Aware Instruction Scheduling for OoO Processors , 2019, 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[23]  Christoforos E. Kozyrakis,et al.  Shinjuku: Preemptive Scheduling for μsecond-scale Tail Latency , 2019, NSDI.

[24]  David Black-Schaffer,et al.  Freeway: Maximizing MLP for Slice-Out-of-Order Execution , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[25]  Lieven Eeckhout,et al.  Precise Runahead Execution , 2019, IEEE Computer Architecture Letters.

[26]  Qian Li,et al.  Arachne: Core-Aware Thread Management , 2018, OSDI.

[27]  Stefanos Kaxiras,et al.  The Superfluous Load Queue , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[28]  Stefanos Kaxiras,et al.  SWOOP: software-hardware co-design for non-speculative, execute-ahead, in-order cores , 2018, PLDI.

[29]  Jóakim von Kistowski,et al.  SPEC CPU2017: Next-Generation Compute Benchmark , 2018, ICPE Companion.

[30]  Sujan Kumar Gonugondla,et al.  A Multi-Functional In-Memory Inference Processor Using a Standard 6T SRAM Array , 2018, IEEE Journal of Solid-State Circuits.

[31]  Kaushik Roy,et al.  X-SRAM: Enabling In-Memory Boolean Computations in CMOS Static Random Access Memories , 2017, IEEE Transactions on Circuits and Systems I: Regular Papers.

[32]  David Blaauw,et al.  Cache Automaton , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[33]  Stefanos Kaxiras,et al.  Non-speculative load-load reordering in TSO , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[34]  Stefanos Kaxiras,et al.  Exploring the Performance Limits of Out-of-order Commit , 2017, Conf. Computing Frontiers.

[35]  Efraim Rotem,et al.  Inside 6th-Generation Intel Core: New Microarchitecture Code-Named Skylake , 2017, IEEE Micro.

[36]  Onur Mutlu,et al.  Continuous runahead: Transparent hardware acceleration for memory intensive workloads , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[37]  Naveen Verma,et al.  A machine-learning classifier implemented in a standard 6T SRAM array , 2016, 2016 IEEE Symposium on VLSI Circuits (VLSI-Circuits).

[38]  Miao Hu,et al.  ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[39]  David Blaauw,et al.  A 28 nm Configurable Memory (TCAM/BCAM/SRAM) Using Push-Rule 6T Bit Cell Enabling Logic-in-Memory , 2016, IEEE Journal of Solid-State Circuits.

[40]  Margaret Martonosi,et al.  DeSC: Decoupled supply-compute communication management for heterogeneous architectures , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[41]  David Black-Schaffer,et al.  Long term parking (LTP): Criticality-aware resource allocation in OOO processors , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[42]  Cong Yan,et al.  A scalable architecture for ordered parallelism , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[43]  Keshav Pingali,et al.  Priority Queues Are Not Good Concurrent Priority Schedulers , 2015, Euro-Par.

[44]  Lieven Eeckhout,et al.  The Load Slice Core microarchitecture , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[45]  Gu-Yeon Wei,et al.  Profiling a warehouse-scale computer , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[46]  Michael Gschwind,et al.  IBM POWER8 processor core microarchitecture , 2015, IBM J. Res. Dev..

[47]  Min Huang,et al.  An Energy Efficient 32-nm 20-MB Shared On-Die L3 Cache for Intel® Xeon® Processor E5 Family , 2013, IEEE Journal of Solid-State Circuits.

[48]  Craig B. Zilles,et al.  Discerning the dominant out-of-order performance advantage: is it speculation or dynamism? , 2013, ASPLOS '13.

[49]  S McFarlinDaniel,et al.  Discerning the dominant out-of-order performance advantage , 2013 .

[50]  Keshav Pingali,et al.  The tao of parallelism in algorithms , 2011, PLDI '11.

[51]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[52]  Michael Golden,et al.  40-Entry unified out-of-order scheduler and integer execution unit for the AMD Bulldozer x86–64 core , 2011, 2011 IEEE International Solid-State Circuits Conference.

[53]  Grigorios Magklis,et al.  Processor Microarchitecture: An Implementation Perspective , 2010, Processor Microarchitecture.

[54]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[55]  S.P. Marti,et al.  A Complexity-Effective Out-of-Order Retirement Microarchitecture , 2009, IEEE Transactions on Computers.

[56]  Chandandeep Singh Pabla Completely fair scheduler , 2009 .

[57]  Gabriel H. Loh,et al.  Matrix scheduler reloaded , 2007, ISCA '07.

[58]  Amir Roth,et al.  Store vulnerability window (SVW): re-execution filtering for enhanced load optimization , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[59]  Mikko H. Lipasti,et al.  Deconstructing commit , 2004, IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004.

[60]  Jaume Abella,et al.  Power- and Complexity-Aware Issue Queue Designs , 2003, IEEE Micro.

[61]  Onur Mutlu,et al.  Runahead execution: an alternative to very large instruction windows for out-of-order processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[62]  J.F. Martinez,et al.  Cherry: Checkpointed early resource recycling in out-of-order microprocessors , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[63]  Chris Wilkerson,et al.  Hierarchical scheduling windows , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[64]  S. Tomita,et al.  A high-speed dynamic instruction scheduling scheme for supersealar processors , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[65]  Richard E. Kessler,et al.  The Alpha 21264 microprocessor , 1999, IEEE Micro.

[66]  T. Fischer,et al.  Issue Logic For A 600 MHz Out-of-order Execution , 1997, Symposium 1997 on VLSI Circuits.

[67]  Kenneth C. Yeager The Mips R10000 superscalar microprocessor , 1996, IEEE Micro.

[68]  R. M. Tomasulo,et al.  An efficient algorithm for exploiting multiple arithmetic units , 1995 .

[69]  Izidor Gertner,et al.  On the Complexity of Scheduling Problems for Parallel/Pipelined Machines , 1989, IEEE Trans. Computers.

[70]  Andrew R. Pleszkun,et al.  Implementing Precise Interrupts in Pipelined Processors , 1988, IEEE Trans. Computers.

[71]  Scott Owens,et al.  x86-TSO: A Rigorous and Usable Programmer’s Model for x86 Multiprocessors , 2022 .

[72]  Nick McKeown,et al.  The nanoPU: A Nanosecond Network Stack for Datacenters , 2021, OSDI.

[73]  Hari Balakrishnan,et al.  Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads , 2019, NSDI.

[74]  Yunsup Lee,et al.  The RISC-V Instruction Set Manual , 2014 .

[75]  No License,et al.  Intel ® 64 and IA-32 Architectures Software Developer ’ s Manual Volume 3 A : System Programming Guide , Part 1 , 2006 .

[76]  Brad Calder,et al.  SimPoint 3.0: Faster and More Flexible Program Phase Analysis , 2005, J. Instr. Level Parallelism.

[77]  Michael L. Overton,et al.  Numerical Computing with IEEE Floating Point Arithmetic , 2001 .