The impact of resource partitioning on SMT processors

Simultaneous multithreading (SMT) increases processor throughput by multiplexing resources among several threads. Despite the commercial availability of SMT processors, several aspects of this resource sharing are not well understood. For example, academic SMT studies typically assume that resources are shared dynamically, while industrial designs tend to divide resources statically among threads. This study seeks to quantify the performance impact of resource partitioning policies in SMT machines, focusing on the execution portion of the pipeline. We find that for storage resources, such as the instruction queue and reorder buffer, statically allocating an equal portion to each thread provides good performance, in part by avoiding starvation. The enforced fairness provided by this partitioning obviates sophisticated fetch policies to a large extent. SMT's potential ability to allocate storage resources dynamically across threads does not appear to be of significant benefit. In contrast, static division of issue bandwidth has a negative impact on throughput. SMT's ability to multiplex bursty execution streams dynamically onto shared function units contributes to its overall throughput. Finally, we apply these insights to SMT support in clustered architectures. Assigning threads to separate clusters eliminates intercluster communication; however, in some circumstances, the resulting partitioning of issue bandwidth cancels out the performance benefit of eliminating communication.

[1]  Yiannakis Sazeides,et al.  How to compare the performance of two SMT microarchitectures , 2001, 2001 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS..

[2]  J. E. Thornton,et al.  Parallel operation in the control data 6600 , 1964, AFIPS '64 (Fall, part II).

[3]  Mario Nemirovsky,et al.  Increasing superscalar performance through multistreaming , 1995, PACT.

[4]  G. Edward Suh,et al.  A new memory monitoring scheme for memory-aware scheduling and partitioning , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[5]  Dean M. Tullsen,et al.  Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[6]  Manoj Franklin,et al.  PEWs: a decentralized dynamic scheduler for ILP processing , 1996, Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing.

[7]  P. Gronowski,et al.  Design of an 8-wide superscalar RISC microprocessor with simultaneous multithreading , 2002, 2002 IEEE International Solid-State Circuits Conference. Digest of Technical Papers (Cat. No.02CH37315).

[8]  William J. Miceli,et al.  Real Time Signal Processing , 1985 .

[9]  Ramon Canal,et al.  Dynamic cluster assignment mechanisms , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[10]  Norman P. Jouppi,et al.  The multicluster architecture: reducing cycle time through partitioning , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[11]  D. Marr,et al.  Hyper-Threading Technology Architecture and MIcroarchitecture , 2002 .

[12]  James E. Smith,et al.  Complexity-Effective Superscalar Processors , 1997, ISCA.

[13]  Richard E. Kessler,et al.  The Alpha 21264 microprocessor , 1999, IEEE Micro.

[14]  Donald Yeung,et al.  Transparent threads: resource sharing in SMT processors for high single-thread performance , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[15]  Mario Nemirovsky,et al.  DISC: dynamic instruction stream computer , 1991, MICRO 24.

[16]  Hwa C. Torng,et al.  The Concurrent Execution of Multiple Instruction Streams on Superscalar Processors , 1991, ICPP.

[17]  James E. Smith,et al.  An instruction set and microarchitecture for instruction level distributed processing , 2002, ISCA.

[18]  Lori Pollock,et al.  An experimental study of several cooperative register allocation and instruction scheduling strategies , 1995, MICRO 1995.

[19]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[20]  R. Rosner Computer software , 1978, Nature.

[21]  Burton J. Smith Architecture And Applications Of The HEP Multiprocessor Computer System , 1982, Optics & Photonics.

[22]  Dean M. Tullsen,et al.  Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000, SIGP.

[23]  Anoop Gupta,et al.  Interleaving: a multithreading technique targeting multiprocessors and workstations , 1994, ASPLOS VI.

[24]  Lieven Eeckhout,et al.  Quantifying the Impact of Input Data Sets on Program Behavior and its Applications , 2003, J. Instr. Level Parallelism.

[25]  Kozo Kimura,et al.  An Elementary Processor Architecture with Simultaneous Instruction Issuing from Multiple Threads , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[26]  Dean M. Tullsen,et al.  Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading , 1997, TOCS.

[27]  James E. Thomton,et al.  Parallel Operation in the Control Data 6600 , 1899 .

[28]  David J. Sager,et al.  The microarchitecture of the Pentium 4 processor , 2001 .

[29]  Dean M. Tullsen,et al.  Handling long-latency loads in a simultaneous multithreading processor , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[30]  Nathan L. Binkert,et al.  Network-Oriented Full-System Simulation using M5 , 2003 .

[31]  Anant Agarwal,et al.  APRIL: a processor architecture for multiprocessing , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[32]  Allan Porterfield,et al.  The Tera computer system , 1990 .

[33]  Doug Burger,et al.  Evaluating Future Microprocessors: the SimpleScalar Tool Set , 1996 .

[34]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[35]  M TullsenDean,et al.  Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000 .