Study of an In-order SMT Architecture and Grouping Schemes

In this paper, we propose a simultaneous multithreading (SMT) architecture that im- proves instruction throughput by exploiting instruction level parallelism (ILP) and thread level parallelism (TLP). The proposed architecture issues and completes instructions belonging to the same thread in exact program order. The issue and completion policy greatly reduces the design complexity and hardware cost of our architecture, compared with others that employ out-of- order issue and completion. On the other hand, when the instructions belong to different threads, the issue and completion orders for those instructions may not necessarily be identical to the fetch order. The processor issues instructions simultaneously from multiple threads to functional units by exploiting ILP and TLP, and by dynamic resource sharing. That parallel execution no- tably improves performance and resource utilization with minimal additional hardware cost over the conventional superscalar processors. This paper proposes an SMT architecture with grouping as well as one without grouping. Without grouping, all threads dynamically and flexibly share most resources. On the other hand, in the SMT architecture with grouping, in which resources and threads are divided into several groups for design simplification, resources are shared only among threads belonging to the same group as those resources. Simulation results show that our processors with four and eight threads improve performance by three or more times over the conventional superscalar processor with comparable execution resources and policies, and that reasonable grouping reduces the design complexity of SMT processors with little negative effect on performance.

[1]  Gurindar S. Sohi High-Bandwidth Interleaved Memories for Vector Processors-A Simulation Study , 1993, IEEE Trans. Computers.

[2]  Kozo Kimura,et al.  An Elementary Processor Architecture with Simultaneous Instruction Issuing from Multiple Threads , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[3]  Sang-Hui Park,et al.  A Rotation Invariant Image Retrieval with Local Features , 2003 .

[4]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[5]  Mario Nemirovsky,et al.  Increasing superscalar performance through multistreaming , 1995, PACT.

[6]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[7]  Jack L. Lo,et al.  Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[8]  Mike Johnson,et al.  Superscalar microprocessor design , 1991, Prentice Hall series in innovative technology.

[9]  Nader Bagherzadeh,et al.  Special features of a VLIW architecture , 1991, [1991] Proceedings. The Fifth International Parallel Processing Symposium.

[10]  Chuan-lin Wu,et al.  A Benchmark Evaluation of a Multi-threaded RISC Processor Architecture , 1991, ICPP.

[11]  John L. Henning SPEC CPU2000: Measuring CPU Performance in the New Millennium , 2000, Computer.