Cache-Only Memory Architectures

The shared memory concept makes it easier to write parallel programs, but tuning the application to reduce the impact of frequent long latency memory accesses still requires substantial programmer effort. Researchers have proposed using compilers, operating systems, or architectures to improve performance by allocating data close to the processors that use it. The Cache-Only Memory Architecture (COMA) increases the chances of data being available locally because the hardware transparently replicates the data and migrates it to the memory module of the node that is currently accessing it. Each memory module acts as a huge cache memory in which each block has a tag with the address and the state. The authors explain the functionality, architecture, performance, and complexity of COMA systems. They also outline different COMA designs, compare COMA to traditional nonuniform memory access (NUMA) systems, and describe proposed improvements in NUMA systems that target the same performance obstacles as COMA.

[1]  Anoop Gupta,et al.  Comparative performance evaluation of cache-coherent NUMA and COMA architectures , 1992, ISCA '92.

[2]  Erik Hagersten,et al.  WildFire: a scalable path for SMPs , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[3]  Anoop Gupta,et al.  Flexible use of memory for replication/migration in cache-coherent DSM multiprocessors , 1998, ISCA.

[4]  DahlgrenFredrik,et al.  Cache-Only Memory Architectures , 1999 .

[5]  Anoop Gupta,et al.  Operating system support for improving data locality on CC-NUMA compute servers , 1996, ASPLOS VII.

[6]  Per Stenström,et al.  Evaluation of Hardware-Based Stride and Sequential Prefetching in Shared-Memory Multiprocessors , 1996, IEEE Trans. Parallel Distributed Syst..

[7]  Erik Hagersten,et al.  DDM - A Cache-Only Memory Architecture , 1992, Computer.

[8]  Allan Porterfield,et al.  The Tera computer system , 1990 .

[9]  Sarita V. Adve,et al.  Shared Memory Consistency Models: A Tutorial , 1996, Computer.

[10]  P. Stenstrom A survey of cache coherence schemes for multiprocessors , 1990, Computer.

[11]  Adrian Moga,et al.  The effectiveness of SRAM network caches in clustered DSMs , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[12]  D.A. Wood,et al.  Reactive NUMA: A Design For Unifying S-COMA And CC-NUMA , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[13]  Josep Torrellas,et al.  Enhancing memory use in Simple Coma: Multiplexed Simple Coma , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[14]  John L. Hennessy,et al.  Evaluating the memory overhead required for COMA architectures , 1994, ISCA '94.

[15]  John B. Carter,et al.  An argument for simple COMA , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[16]  Beng-Hong Lim,et al.  PRISM: an integrated architecture for scalable shared memory , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[17]  Josep Torrellas,et al.  Reducing remote conflict misses: NUMA with remote cache versus COMA , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[18]  Todd C. Mowry,et al.  Tolerating latency through software-controlled data prefetching , 1994 .