论文信息 - Two-Level Reorder Buffers: Accelerating Memory-Bound Applications on SMT Architectures

Two-Level Reorder Buffers: Accelerating Memory-Bound Applications on SMT Architectures

We propose a low complexity mechanism for accelerating memory-bound threads on SMT processors without adversely impacting the performance of other concurrently running applications. The main idea is to provide a two-level organization of the Reorder Buffer (ROB), where the first level is comprised of small private per-thread ROBs which are used in the normal course of execution in the absence of last level cache misses. The second ROB level is a much larger storage that can be used on demand by threads experiencing last level cache misses. The key feature of our scheme is that the allocation of the second-level ROB partition occurs to a thread experiencing a miss into the last level cache only if the number of instructions dependent on the missing load is below a predetermined threshold. We introduce a novel low-complexity mechanism to count the number of load-dependent instructions and propose two schemes for allocating second level ROB: predictive and reactive. Our results demonstrate about 30% improvement over DCRA resource distribution mechanism in terms of "harmonic mean of weighted IPCs" metric.

Dmitry V. Ponomarev | Jason Loew

[1] Dean M. Tullsen,et al. Handling long-latency loads in a simultaneous multithreading processor , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[2] Gürhan Küçük,et al. Reducing power requirements of instruction scheduling through dynamic allocation of multiple datapath resources , 2001, MICRO.

[3] Joseph J. Sharkey,et al. Adaptive reorder buffers for SMT processors , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[4] Donald Yeung,et al. Transparent threads: resource sharing in SMT processors for high single-thread performance , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[5] Manoj Franklin,et al. Balancing thoughput and fairness in SMT processors , 2001, 2001 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS..

[6] Haitham Akkary,et al. Continual flow pipelines , 2004, ASPLOS XI.

[7] Jack L. Lo,et al. Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[8] Todd M. Austin,et al. The SimpleScalar tool set, version 2.0 , 1997, CARN.

[9] Wei Liu,et al. ReSlice: selective re-execution of long-retired misspeculated instructions using forward slicing , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[10] D. Marr,et al. Hyper-Threading Technology Architecture and MIcroarchitecture , 2002 .

[11] Kanad Ghose,et al. Reducing power requirements of instruction scheduling through dynamic allocation of multiple datapath resources , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[12] David M. Brooks,et al. A circuit level implementation of an adaptive issue queue for power-aware microprocessors , 2001, GLSVLSI '01.

[13] David H. Albonesi,et al. Front-end policies for improved issue efficiency in SMT processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[14] James E. Smith,et al. Complexity-Effective Superscalar Processors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[15] Dean M. Tullsen,et al. Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[16] Francisco J. Cazorla,et al. Dynamically Controlled Resource Allocation in SMT Processors , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[17] Nasser Yazdani,et al. Thread-Sensitive Instruction Issue for SMT Processors , 2004, IEEE Computer Architecture Letters.

[18] Mikko H. Lipasti,et al. Understanding scheduling replay schemes , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[19] Joseph J. Sharkey,et al. An L2-miss-driven early register deallocation for SMT processors , 2007, ICS '07.

[20] Yale N. Patt,et al. On pipelining dynamic instruction scheduling logic , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.

[21] Steven K. Reinhardt,et al. The impact of resource partitioning on SMT processors , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[22] John L. Henning. SPEC CPU2000: Measuring CPU Performance in the New Millennium , 2000, Computer.

[23] Brad Calder,et al. Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[24] Stijn Eyerman,et al. A Memory-Level Parallelism Aware Fetch Policy for SMT Processors , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[25] David A. Koufaty,et al. Hyperthreading Technology in the Netburst Microarchitecture , 2003, IEEE Micro.

[26] Haitham Akkary,et al. Continual flow pipelines: achieving resource-efficient latency tolerance , 2004, IEEE Micro.

[27] Francisco J. Cazorla,et al. Improving Memory Latency Aware Fetch Policies for SMT Processors , 2003, ISHPC.