Active Memory Clusters : Efficient Multiprocessing on Next-Generation Servers

We show how key insights from our research into active memory systems, coupled with emerging trends in commodity network technology, are leading toward the realization of hardware distributed shared memory (DSM) on clusters of industry-standard workstations. We call the result of this convergence active memory clusters. After discussing the current state of the art in hardware DSM, clusters, and software DSM architectures, we highlight the key differences between hardware and software DSM systems and show how these differences are rapidly disappearing in commodity systems—with the notable exception of the specialized memory controller present in hardware DSM systems. We then discuss our recent research results in active memory systems showing that our active memory controller design increases single-node performance. These results argue for the inclusion of active memory support in forthcoming commodity workstations. We show that active memory support can be treated as an extension of the cache coherence protocol, and that an active memory controller also contains the necessary functionality for building a hardware DSM machine. Coupled with enhancements in network technology and a small amount of software support, active memory clusters can achieve hardware DSM performance on next-generation commodity servers.

[1]  John B. Carter,et al.  Memory System Support for Dynamic Cache Line Assembly , 2000, Intelligent Memory Systems.

[2]  John L. Hennessy,et al.  SoftFLASH: analyzing the performance of clustered distributed virtual shared memory , 1996, ASPLOS VII.

[3]  D. Lenoski,et al.  The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[4]  Paul Hudak,et al.  Memory coherence in shared virtual memory systems , 1989, TOCS.

[5]  Daniel E. Lenoski,et al.  The design and analysis of DASH: a scalable directory-based multiprocessor , 1992 .

[6]  Erik Brunvand,et al.  Impulse: building a smarter memory controller , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[7]  Maya Gokhale,et al.  Processing in Memory: The Terasys Massively Parallel PIM Array , 1995, Computer.

[8]  Mats Brorsson,et al.  An adaptive cache coherence protocol optimized for migratory sharing , 1993, ISCA '93.

[9]  Mike Galles Spider: a high-speed network interconnect , 1997, IEEE Micro.

[10]  Anoop Gupta,et al.  Flexible use of memory for replication/migration in cache-coherent DSM multiprocessors , 1998, ISCA.

[11]  Peter J. Keleher,et al.  The relative importance of concurrent writers and weak consistency models , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[12]  Anoop Gupta,et al.  Optimized multiprocessor communication and synchronization using a programmable protocol engine , 1998 .

[13]  Fong Pong,et al.  Missing the Memory Wall: The Case for Processor/Memory Integration , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[14]  John B. Carter,et al.  Design of the Munin Distributed Shared Memory System , 1995, J. Parallel Distributed Comput..

[15]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[16]  Jr. Richard Thomas Simoni,et al.  Cache coherence directories for scalable multiprocessors , 1992 .

[17]  Seung-Moon Yoo,et al.  FlexRAM: toward an advanced intelligent memory system , 1999, Proceedings 1999 IEEE International Conference on Computer Design: VLSI in Computers and Processors (Cat. No.99CB37040).

[18]  Michael C. Browne,et al.  The S3.mp scalable shared memory multiprocessor , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[19]  John L. Hennessy,et al.  The performance and scalability of distributed shared memory cache coherence protocols , 1998 .

[20]  Alan L. Cox,et al.  TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems , 1994, USENIX Winter.

[21]  Josep Torrellas,et al.  Adaptively Mapping Code in an Intelligent Memory Architecture , 2000, Intelligent Memory Systems.

[22]  Jaewook Shin,et al.  Mapping Irregular Applications to DIVA, a PIM-based Data-Intensive Architecture , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[23]  John K. Bennett,et al.  Brazos: a third generation DSM system , 1997 .

[24]  Gheith A. Abandah,et al.  Effects of architectural and technological advances on the HP/Convex Exemplar's memory and communication performance , 1998, ISCA.

[25]  James R. Larus,et al.  Tempest and typhoon: user-level shared memory , 1994, ISCA '94.

[26]  Kourosh Gharachorloo,et al.  Shasta: a low overhead, software-only approach for supporting fine-grain shared memory , 1996, ASPLOS VII.

[27]  Cheng Liao,et al.  Using network interface support to avoid asynchronous protocol processing in shared virtual memory systems , 1999, ISCA.

[28]  Kirk L. Johnson,et al.  CRL: high-performance all-software distributed shared memory , 1995, SOSP.

[29]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[30]  R. Stephenson A and V , 1962, The British journal of ophthalmology.

[31]  Srinivasan Parthasarathy,et al.  Cashmere-2L: software coherent shared memory on a clustered remote-write network , 1997, SOSP.

[32]  Mark Heinrich,et al.  FLASH vs. (simulated) FLASH: closing the simulation loop , 2000, SIGP.

[33]  Daehyun Kim,et al.  Leveraging cache coherence in active memory systems , 2002, ICS '02.

[34]  Katherine Yelick,et al.  A Case for Intelligent RAM: IRAM , 1997 .

[35]  Frederic T. Chong,et al.  Active pages: a computation model for intelligent memory , 1998, ISCA.

[36]  Anoop Gupta,et al.  The Stanford FLASH multiprocessor , 1994, ISCA '94.

[37]  Donald Yeung,et al.  The MIT Alewife machine: architecture and performance , 1995, ISCA '98.

[38]  Andrew P. Black,et al.  Fine-grained mobility in the Emerald system , 1987, TOCS.

[39]  A. Agarwal,et al.  MGS: A Multigrain Shared Memory System , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[40]  Robbert van Renesse,et al.  Experiences with the Amoeba distributed operating system , 1990, CACM.

[41]  Anant Agarwal,et al.  LimitLESS directories: A scalable cache coherence scheme , 1991, ASPLOS IV.

[42]  John L. Hennessy,et al.  The Effects of Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors , 1995 .

[43]  Rajit Manohar,et al.  A Case For Asynchronous Active Memories , 2000 .