A Case for Low-Complexity Multi-CMP Architectures

Plentiful research has addressed low-complexity software-based shared-memory systems since the idea was first introduced more than two decades ago. However, software-coherent systems have not been very successful in the commercial marketplace. We believe there are two main reasons for this: lack of performance and/or lack of binary compatibility. This thesis studies multiple aspects of how to design future binary-compatible high-performance scalable shared-memory servers while keeping the hardware complexity at a minimum. It starts with a software-based distributed shared-memory system relying on no specific hardware support and gradually moves towards architectures with simple hardware support. The evaluation is made in a modern chip-multiprocessor environment with both high-performance compute workloads and commercial applications. It shows that implementing the coherence-violation detection in hardware while solving the interchip coherence in software allows for high-performing binary-compatible systems with very low hardware complexity. Our second-generation hardware-software hybrid performs on par with, and often better than, traditional hardware-only designs. Based on our results, we conclude that it is not only possible to design simple systems while maintaining performance and the binary-compatibility envelope, it is often possible to get better performance than in traditional and more complex designs. We also explore two new techniques for evaluating a new shared-memory design throughout this work: adjustable simulation fidelity and statistical multiprocessor cache modeling.

[1]  Margaret Martonosi,et al.  Informing Memory Operations: Providing Memory Performance Feedback in Modern Processors , 1996, ISCA.

[2]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[3]  T. Brewer,et al.  The evolution of the HP/Convex Exemplar , 1997, Proceedings IEEE COMPCON 97. Digest of Papers.

[4]  Erik Hagersten,et al.  Simple COMA node implementations , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[5]  David L Weaver,et al.  The SPARC architecture manual : version 9 , 1994 .

[6]  A. Gupta,et al.  The Stanford FLASH multiprocessor , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[7]  Tom Lovett,et al.  STiNG: A CC-NUMA Computer System for the Commercial Marketplace , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[8]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[9]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[10]  Michel Dubois,et al.  Sequential Hardware Prefetching in Shared-Memory Multiprocessors , 1995, IEEE Trans. Parallel Distributed Syst..

[11]  Balaram Sinharoy,et al.  POWER4 system microarchitecture , 2002, IBM J. Res. Dev..

[12]  Erik Hagersten,et al.  VASA: A Simulator Infrastructure with Adjustable Fidelity , 2005, IASTED PDCS.

[13]  Kunle Olukotun,et al.  The case for a single-chip multiprocessor , 1996, ASPLOS VII.

[14]  Luiz André Barroso,et al.  Piranha: a scalable architecture based on single-chip multiprocessing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[15]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[16]  J. Larus,et al.  Tempest and Typhoon: user-level shared memory , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[17]  Ravi Rajwar,et al.  Speculative lock elision: enabling highly concurrent multithreaded execution , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[18]  Kourosh Gharachorloo,et al.  Efficient ECC-Based Directory Implementations for Scalable Multiprocessors , 2000 .

[19]  Erik Hagersten,et al.  TMA: a trap-based memory architecture , 2006, ICS '06.

[20]  Mainak Chaudhuri,et al.  SMTp: an architecture for next-generation scalable multi-threading , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[21]  John L. Hennessy,et al.  An evaluation of a commercial CC-NUMA architecture-the CONVEX Exemplar SPP1200 , 1997, Proceedings 11th International Parallel Processing Symposium.

[22]  David A. Koufaty,et al.  Hyperthreading Technology in the Netburst Microarchitecture , 2003, IEEE Micro.

[23]  James R. Larus,et al.  Fine-grain access control for distributed shared memory , 1994, ASPLOS VI.

[24]  D. Lenoski,et al.  The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[25]  Anoop Gupta,et al.  The directory-based cache coherence protocol for the DASH multiprocessor , 1990, ISCA '90.

[26]  Michael C. Browne,et al.  The S3.mp scalable shared memory multiprocessor , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[27]  David A. Wood,et al.  Decoupled Hardware Support for Distributed Shared Memory , 1996, ISCA.