A controlled fetching technique for effective management of shared resources in SMT processors

Abstract Simultaneous Multi-Threading (SMT) is a processor design technique that supports concurrent execution of instructions from multiple threads in every cycle by sharing the key datapath components. In the SMT architecture, the shared resources normally include the physical register file, Issue Queue (IQ), functional units, write buffer and the cache memory. Efficient utilization of the shared resources is critical to achieving high-performance gain. The physical rename register file is one of the most critical shared resources in the SMT architecture due to its being located at forefront of the pipeline stages. The inter-thread sharing of the physical registers reduces the number of registers required in the SMT processors than would have been needed in deploying multiple superscalar processors to achieve a similar throughput. However, due to the nature of sharing, an overwhelming occupancy of the physical register file by any slower threads can lead to a shortage of registers available for the other threads in the system and thus degrade the overall performance. In this paper, we propose an intelligent fetching algorithm for efficient management of the shared physical register file. Even though the primary focus of this paper is to manage the physical register file effectively, it indirectly controls the other shared resources downstream in the pipeline as well. The main goal of this paper is to propose a simple resource management scheme capable of achieving a considerable performance gain that neither incurs a substantial processing or hardware overhead for practical implementation nor requires modifications in the other pipeline stages. We demonstrate that temporarily suspending the slow threads from the system in the fetch stage can improve the overall system performance by a significant margin. An improvement of up to 63% and 68% is achieved when the proposed scheme is applied to the 4-threaded and the 8-threaded system respectively. The throughput of an 8-threaded system with 320 register file entries is significantly higher than the performance of default system with 416 register entries indicating a resource saving of 60%.

[1]  Dean M. Tullsen,et al.  Handling long-latency loads in a simultaneous multithreading processor , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[2]  Eduardo Quiñones,et al.  Leveraging Register Windows to Reduce Physical Registers to the Bare Minimum , 2010, IEEE Transactions on Computers.

[3]  Madhava Krishnan Ramanathan,et al.  An Intelligent Fetching algorithm For Efficient Physical Register File Allocation In Simultaneous Multi-Threading CPUs , 2017 .

[4]  Wenjun Wang,et al.  Efficient physical register file allocation with thread suspension for simultaneous multi-threading processors , 2016, ICSE 2016.

[5]  Mateo Valero,et al.  Dynamic Register Renaming Through Virtual-Physical Registers , 2000, J. Instr. Level Parallelism.

[6]  Jack L. Lo,et al.  Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[7]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[8]  Francisco J. Cazorla,et al.  Dynamically Controlled Resource Allocation in SMT Processors , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[9]  Dean M. Tullsen,et al.  Software-Directed Register Deallocation for Simultaneous Multithreaded Processors , 1999, IEEE Trans. Parallel Distributed Syst..

[10]  Kozo Kimura,et al.  An elementary processor architecture with simultaneous instruction issuing from multiple threads , 1992, ISCA '92.