Two-Level Reorder Buffers: Accelerating Memory-Bound Applications on SMT Architectures

We propose a low complexity mechanism for accelerating memory-bound threads on SMT processors without adversely impacting the performance of other concurrently running applications. The main idea is to provide a two-level organization of the Reorder Buffer (ROB), where the first level is comprised of small private per-thread ROBs which are used in the normal course of execution in the absence of last level cache misses. The second ROB level is a much larger storage that can be used on demand by threads experiencing last level cache misses. The key feature of our scheme is that the allocation of the second-level ROB partition occurs to a thread experiencing a miss into the last level cache only if the number of instructions dependent on the missing load is below a predetermined threshold. We introduce a novel low-complexity mechanism to count the number of load-dependent instructions and propose two schemes for allocating second level ROB: predictive and reactive. Our results demonstrate about 30% improvement over DCRA resource distribution mechanism in terms of "harmonic mean of weighted IPCs" metric.

[1]  Dean M. Tullsen,et al.  Handling long-latency loads in a simultaneous multithreading processor , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[2]  Gürhan Küçük,et al.  Reducing power requirements of instruction scheduling through dynamic allocation of multiple datapath resources , 2001, MICRO.

[3]  Joseph J. Sharkey,et al.  Adaptive reorder buffers for SMT processors , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[4]  Donald Yeung,et al.  Transparent threads: resource sharing in SMT processors for high single-thread performance , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[5]  Manoj Franklin,et al.  Balancing thoughput and fairness in SMT processors , 2001, 2001 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS..

[6]  Haitham Akkary,et al.  Continual flow pipelines , 2004, ASPLOS XI.

[7]  Jack L. Lo,et al.  Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[8]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[9]  Wei Liu,et al.  ReSlice: selective re-execution of long-retired misspeculated instructions using forward slicing , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[10]  D. Marr,et al.  Hyper-Threading Technology Architecture and MIcroarchitecture , 2002 .

[11]  Kanad Ghose,et al.  Reducing power requirements of instruction scheduling through dynamic allocation of multiple datapath resources , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[12]  David M. Brooks,et al.  A circuit level implementation of an adaptive issue queue for power-aware microprocessors , 2001, GLSVLSI '01.

[13]  David H. Albonesi,et al.  Front-end policies for improved issue efficiency in SMT processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[14]  James E. Smith,et al.  Complexity-Effective Superscalar Processors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[15]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[16]  Francisco J. Cazorla,et al.  Dynamically Controlled Resource Allocation in SMT Processors , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[17]  Nasser Yazdani,et al.  Thread-Sensitive Instruction Issue for SMT Processors , 2004, IEEE Computer Architecture Letters.

[18]  Mikko H. Lipasti,et al.  Understanding scheduling replay schemes , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[19]  Joseph J. Sharkey,et al.  An L2-miss-driven early register deallocation for SMT processors , 2007, ICS '07.

[20]  Yale N. Patt,et al.  On pipelining dynamic instruction scheduling logic , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.

[21]  Steven K. Reinhardt,et al.  The impact of resource partitioning on SMT processors , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[22]  John L. Henning SPEC CPU2000: Measuring CPU Performance in the New Millennium , 2000, Computer.

[23]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[24]  Stijn Eyerman,et al.  A Memory-Level Parallelism Aware Fetch Policy for SMT Processors , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[25]  David A. Koufaty,et al.  Hyperthreading Technology in the Netburst Microarchitecture , 2003, IEEE Micro.

[26]  Haitham Akkary,et al.  Continual flow pipelines: achieving resource-efficient latency tolerance , 2004, IEEE Micro.

[27]  Francisco J. Cazorla,et al.  Improving Memory Latency Aware Fetch Policies for SMT Processors , 2003, ISHPC.