The need for fast communication in hardware-based speculative chip multiprocessors

Chip-multiprocessor (CMP) architectures are a promising design alternative to exploit the ever-increasing number of transistors that can be put on a die. To deliver high performance on applications that cannot be easily parallelized, CMPs can use additional support for speculatively executing the possibly data-dependent threads of an application. While some of the cross-thread dependences in applications must be handled dynamically, others can be fully determined by the compiler. For the latter dependences, the threads can be made to synchronize and communicate either at the register level or at the memory level. In the past, it has been unclear whether the higher hardware cost of register-level communication is cost-effective. In this paper, we show that the wide-issue dynamic processors that will soon populate CMPs, make fast communication a requirement for high performance. Consequently, we propose an effective hardware mechanism to support communication and synchronization of registers between on-chip processors. Our scheme adds enough support to enable register-level communication without specializing the architecture so much toward speculation that it leads to much unutilized hardware under workloads that do not need speculative parallelization. Finally, the scheme allows the system to achieve near ideal performance.

[1]  Lori Pollock,et al.  An experimental study of several cooperative register allocation and instruction scheduling strategies , 1995, MICRO 1995.

[2]  Miodrag Potkonjak,et al.  MediaBench: a tool for evaluating and synthesizing multimedia and communications systems , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[3]  Gurindar S. Sohi,et al.  Register traffic analysis for streamlining inter-operation communication in fine-grain parallel processors , 1992, MICRO 1992.

[4]  Dean M. Tullsen,et al.  Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading , 1997, TOCS.

[5]  Gurindar S. Sohi,et al.  Task selection for a multiscalar processor , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[6]  Gurindar S. Sohi,et al.  The anatomy of the register file in a multiscalar processor , 1994, Proceedings of MICRO-27. The 27th Annual IEEE/ACM International Symposium on Microarchitecture.

[7]  Josep Torrellas,et al.  A direct-execution framework for fast and accurate simulation of superscalar processors , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[8]  Todd C. Mowry,et al.  The potential for using thread-level data speculation to facilitate automatic parallelization , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[9]  Kevin O'Brien,et al.  Single-program speculative multithreading (SPSM) architecture: compiler-assisted fine-grained multithreading , 1995, PACT.

[10]  Burton J. Smith Architecture And Applications Of The HEP Multiprocessor Computer System , 1982, Optics & Photonics.

[11]  S. Vajapeyam,et al.  Improving Superscalar Instruction Dispatch And Issue By Exploiting Dynamic Code Sequences , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[12]  Gurindar S. Sohi,et al.  ARB: A Hardware Mechanism for Dynamic Reordering of Memory References , 1996, IEEE Trans. Computers.

[13]  Xiao Gang SMA: A SPECULATIVE MULTITHREADED ARCHITECTURE , 1999 .

[14]  Kunle Olukotun,et al.  Data speculation support for a chip multiprocessor , 1998, ASPLOS VIII.

[15]  Jenn-Yuan Tsai,et al.  The superthreaded architecture: thread pipelining with run-time data dependence checking and control speculation , 1996, Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique.

[16]  Antonio González,et al.  Clustered speculative multithreaded processors , 1999, ICS '99.

[17]  Monica S. Lam,et al.  Maximizing Multiprocessor Performance with the SUIF Compiler , 1996, Digit. Tech. J..

[18]  Robert J. Fowler,et al.  MINT: a front end for efficient simulation of shared-memory multiprocessors , 1994, Proceedings of International Workshop on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[19]  Eric Rotenberg,et al.  Trace cache: a low latency approach to high bandwidth instruction fetching , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[20]  Quinn Jacobson,et al.  Trace processors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[21]  BerryM.,et al.  The Perfect Club Benchmarks , 1989 .

[22]  Yunheung Paek,et al.  Parallel Programming with Polaris , 1996, Computer.

[23]  Geoffrey C. Fox,et al.  The Perfect Club Benchmarks: Effective Performance Evaluation of Supercomputers , 1989, Int. J. High Perform. Comput. Appl..

[24]  Josep Torrellas,et al.  A Chip-Multiprocessor Architecture with Speculative Multithreading , 1999, IEEE Trans. Computers.

[25]  Gurindar S. Sohi,et al.  Speculative Versioning Cache , 2001, IEEE Trans. Parallel Distributed Syst..

[26]  Doug Matzke,et al.  Will Physical Scalability Sabotage Performance Gains? , 1997, Computer.