The performance impact of flexibility in the Stanford FLASH multiprocessor

A flexible communication mechanism is a desirable feature in multiprocessors because it allows support for multiple communication protocols, expands performance monitoring capabilities, and leads to a simpler design and debug process. In the Stanford FLASH multiprocessor, flexibility is obtained by requiring all transactions in a node to pass through a programmable node controller, called MAGIC. In this paper, we evaluate the performance costs of flexibility by comparing the performance of FLASH to that of an idealized hardwired machine on representative parallel applications and a multiprogramming workload. To measure the performance of FLASH, we use a detailed simulator of the FLASH and MAGIC designs, together with the code sequences that implement the cache-coherence protocol. We find that for a range of optimized parallel applications the performance differences between the idealized machine and FLASH are small. For these programs, either the miss rates are small or the latency of the programmable protocol can be hidden behind the memory access time. For applications that incur a large number of remote misses or exhibit substantial hot-spotting, performance is poor for both machines, though the increased remote access latencies or the occupancy of MAGIC lead to lower performance for the flexible design. In most cases, however, FLASH is only 2%–12% slower than the idealized machine.

[1]  Mendel Rosenblum,et al.  SimOS: A Fast Operating System Simulation Environment , 1994 .

[2]  Stephen R. Goldschmidt,et al.  Simulation of multiprocessors: accuracy and performance , 1993 .

[3]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[4]  John L. Hennessy,et al.  The Performance Advantages of Integrating Message Passing in Cache-Coherent Multiprocessors , 1993 .

[5]  John L. Hennessy,et al.  The performance advantages of integrating block data transfer in cache-coherent multiprocessors , 1994, ASPLOS VI.

[6]  Anoop Gupta,et al.  The Stanford FLASH multiprocessor , 1994, ISCA '94.

[7]  Arvind,et al.  T: A Multithreaded Massively Parallel Architecture , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[8]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[9]  Anoop Gupta,et al.  Integration of message passing and shared memory in the Stanford FLASH multiprocessor , 1994, ASPLOS VI.

[10]  Michael D. Smith,et al.  Support for Speculative Execution in High-Performance Processors , 1992 .

[11]  Anoop Gupta,et al.  Working sets, cache sizes, and node granularity issues for large-scale multiprocessors , 1993, ISCA '93.

[12]  Anoop Gupta,et al.  Working Sets, Cache Sizes, And Node Granularity Issues For Large-scale Multiprocessors , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[13]  Jr. Richard Thomas Simoni,et al.  Cache coherence directories for scalable multiprocessors , 1992 .

[14]  Arvind,et al.  T: a multithreaded massively parallel architecture , 1992, ISCA '92.

[15]  R. L. Stewart,et al.  The Design of the DEC 3000 AXP Systems, Two High-performance Workstations , 1992, Digit. Tech. J..

[16]  Donald Yeung,et al.  THE MIT ALEWIFE MACHINE: A LARGE-SCALE DISTRIBUTED-MEMORY MULTIPROCESSOR , 1991 .

[17]  Michael D. Noakes,et al.  The J-machine multicomputer: an architectural evaluation , 1993, ISCA '93.

[18]  James R. Larus,et al.  Tempest and typhoon: user-level shared memory , 1994, ISCA '94.