Comparative performance evaluation of cache-coherent NUMA and COMA architectures

Two interesting variations of large-scale shared-memory machines that have recently emerged are cache-coherent non-uniform-memory-access machines (CC-NUMA) and cache-only memory architectures (COMA). They both have distributed main memory and use directory-based cache coherence. Unlike CC-NUMA, however, COMA machines automatically migrate and replicate data at the main-memory level in cache-line sized chunks. This paper compares the performance of these two classes of machines. We first present a qualitative model that shows that the relative performance is primarily determined by two factors: the relative magnitude of capacity misses versus coherence misses, and the granularity of data partitions in the application. We then present quantitative results using simulation studies for eight parallel applications (including all six applications from the SPLASH benchmark suite). We show that COMA's potential for performance improvement is limited to applications where data accesses by different processors are finely interleaved in memory space and, in addition, where capacity misses dominate over coherence misses. In other situations, for example where coherence misses dominate, COMA can actually perform worse than CC-NUMA due to increased miss latencies caused by its hierarchical directories. Finally, we propose a new architectural alternative, called COMA-F, that combines the advantages of both CC-NUMA and COMA.

[1]  Anoop Gupta,et al.  Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes , 1990, ICPP.

[2]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[3]  Anoop Gupta,et al.  The directory-based cache coherence protocol for the DASH multiprocessor , 1990, ISCA '90.

[4]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[5]  Anoop Gupta,et al.  Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[6]  D LamMonica,et al.  The cache performance and optimizations of blocked algorithms , 1991 .

[7]  Josep Torrellas,et al.  Share Data Placement Optimizations to Reduce Multiprocessor Cache Miss Rates , 1990, ICPP.

[8]  Anoop Gupta,et al.  Competitive management of distributed shared memory , 1989, Digest of Papers. COMPCON Spring 89. Thirty-Fourth IEEE Computer Society International Conference: Intellectual Leverage.

[9]  Michel Dubois,et al.  Cache and Interconnect Architectures in Multiprocessors , 1990, Springer US.

[10]  Erik Hagersten,et al.  The Cache Coherence Protocol of the Data Diffusion Machine , 1989, PARLE.

[11]  John L. Hennessy,et al.  Multiprocessor Simulation and Tracing Using Tango , 1991, ICPP.

[12]  Anant Agarwal,et al.  APRIL: a processor architecture for multiprocessing , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[13]  Robert J. Fowler,et al.  NUMA policies and their relation to memory architecture , 1991, ASPLOS IV.

[14]  Ralph Grishman,et al.  The NYU Ultracomputer—Designing an MIMD Shared Memory Parallel Computer , 1983, IEEE Transactions on Computers.

[15]  Jean-Loup Baer,et al.  Proceedings of the 39th Annual International Symposium on Computer Architecture , 1983, International Symposium on Computer Architecture.

[16]  Erik Hagersten,et al.  The Cache Coherence Protocol of the Data Diffusion Machine , 1989 .