Shasta: a low overhead, software-only approach for supporting fine-grain shared memory

This paper describes Shasta, a system that supports a shared address space in software on clusters of computers with physically distributed memory. A unique aspect of Shasta compared to most other software distributed shared memory systems is that shared data can be kept coherent at a fine granularity. In addition, the system allows the coherence granularity to vary across different shared data structures in a single application. Shasta implements the shared address space by transparently rewriting the application executable to intercept loads and stores. For each shared load or store, the inserted code checks to see if the data is available locally and communicates with other processors if necessary. The system uses numerous techniques to reduce the run-time overhead of these checks. Since Shasta is implemented entirely in software, it also provides tremendous flexibility in supporting different types of cache coherence protocols. We have implemented an efficient cache coherence protocol that incorporates a number of optimizations, including support for multiple communication granularities and use of relaxed memory models. This system is fully functional and runs on a cluster of Alpha workstations.The primary focus of this paper is to describe the techniques used in Shasta to reduce the checking overhead for supporting fine granularity sharing in software. These techniques include careful layout of the shared address space, scheduling the checking code for efficient execution on modern processors, using a simple method that checks loads using only the value loaded, reducing the extra cache misses caused by the checking code, and combining the checks for multiple loads and stores. To characterize the effect of these techniques, we present detailed performance results for the SPLASH-2 applications running on an Alpha processor. Without our optimizations, the checking overheads are excessively high, exceeding 100% for several applications. However, our techniques are effective in reducing these overheads to a range of 5% to 35% for almost all of the applications. We also describe our coherence protocol and present some preliminary results on the parallel performance of several applications running on our workstation cluster. Our experience so far indicates that once the cost of checking memory accesses is reduced using our techniques, the Shasta approach is an attractive software solution for supporting a shared address space with fine-grain access to data.

[1]  David W. Wall,et al.  Global register allocation at link time , 1986, SIGPLAN '86.

[2]  William R. Hamburgen,et al.  Optimal Finned Heat Sinks , 1986 .

[3]  Paul R. Wilson,et al.  A “card-marking” scheme for controlling intergenerational references in generation-based garbage collection on stock hardware , 1989, SIGP.

[4]  W. R. Hamburgen,et al.  Precise robotic paste dot dispensing , 1989, Proceedings., 39th Electronic Components Conference.

[5]  Paul Hudak,et al.  Memory coherence in shared virtual memory systems , 1989, TOCS.

[6]  Silvio Turrini,et al.  Optimal group distribution in carry-skip adders , 1989, Proceedings of 9th Symposium on Computer Arithmetic.

[7]  J. Mogul,et al.  Characterization of Organic Illumination Systems , 1989 .

[8]  Jeffrey Mogul,et al.  Spritely NFS: Implementation and Performance of Cache-Consistency Protocols , 1989 .

[9]  N. P. Jouppi,et al.  A 20-MIPS sustained 32-bit CMOS microprocessor with high ratio of sustained to peak performance , 1989 .

[10]  John K. Ousterhout,et al.  Why Aren't Operating Systems Getting Faster As Fast as Hardware? , 1990, USENIX Summer.

[11]  Anoop Gupta,et al.  COOL: a language for parallel programming , 1990 .

[12]  Willy Zwaenepoel,et al.  Implementation and performance of Munin , 1991, SOSP '91.

[13]  J. Mogul Network locality at the scale of processes , 1991, TOCS.

[14]  Scott McFarling,et al.  Procedure merging with instruction caches , 1991, PLDI '91.

[15]  David W. Wall,et al.  Systems for Late Code Modification , 1991, Code Generation.

[16]  David W. Wall,et al.  Limits of instruction-level parallelism , 1991, ASPLOS IV.

[17]  Henri E. Bal,et al.  Orca: A Language For Parallel Programming of Distributed Systems , 1992, IEEE Trans. Software Eng..

[18]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[19]  Jeffrey C. Mogul,et al.  Observing TCP dynamics in real networks , 1992, SIGCOMM '92.

[20]  Russell Kao,et al.  Piecewise Linear Models for Switch-Level Simulation , 1992 .

[21]  Jeffrey C. Mogul Observing TCP dynamics in real networks , 1992, SIGCOMM 1992.

[22]  T. von Eicken,et al.  Parallel programming in Split-C , 1993, Supercomputing '93.

[23]  Robert N. Mayo,et al.  Boolean matching for full-custom ECL gates , 1993, Proceedings of 1993 International Conference on Computer Aided Design (ICCAD).

[24]  Mark Horowitz,et al.  Piecewise linear models for Rsim , 1993, Proceedings of 1993 International Conference on Computer Aided Design (ICCAD).

[25]  D. Culler,et al.  Parallel programming in Split-C , 1993, Supercomputing '93. Proceedings.

[26]  Brian N. Bershad,et al.  The Midway distributed shared memory system , 1993, Digest of Papers. Compcon Spring.

[27]  Robert N. Mayo,et al.  Boolean matching for full-custom ECL gates , 1993, Proceedings of 1993 International Conference on Computer Aided Design (ICCAD).

[28]  Norman P. Jouppi,et al.  Tradeoffs in two-level on-chip caching , 1994, ISCA '94.

[29]  Rishiyur S. Nikhil,et al.  Cid: A Parallel, "Shared-Memory" C for Distributed-Memory Machines , 1994, LCPC.

[30]  Jeffrey C. Mogul Recovery in Spritely NFS , 1994, Comput. Syst..

[31]  Joel F. Bartlett,et al.  Ramonamap—an example of graphical groupware , 1994, UIST '94.

[32]  Jeffrey C. Mogul,et al.  A Better Update Policy , 1994, USENIX Summer.

[33]  James R. Larus,et al.  Fine-grain access control for distributed shared memory , 1994, ASPLOS VI.

[34]  Monica S. Lam,et al.  The design and evaluation of a shared object system for distributed memory machines , 1994, OSDI '94.

[35]  Alan L. Cox,et al.  TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems , 1994, USENIX Winter.

[36]  Anne Rogers,et al.  Software caching and computation migration in Olden , 1995, PPOPP '95.

[37]  D. Grunwald,et al.  The predictability of branches in libraries , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[38]  Kourosh Gharachorloo,et al.  Memory consistency models for shared-memory multiprocessors , 1995 .

[39]  Joel F. Bartlett,et al.  Experience with a wireless world wide web client , 1995, Digest of Papers. COMPCON'95. Technologies for the Information Superhighway.

[40]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[41]  K. J. Richardson Component Characterization for I / O Cache Designs , 1995 .

[42]  Jeffrey C. Mogul,et al.  Operating systems support for busy Internet servers , 1995, Proceedings 5th Workshop on Hot Topics in Operating Systems (HotOS-V).

[43]  Kirk L. Johnson,et al.  CRL: high-performance all-software distributed shared memory , 1995, SOSP.

[44]  Ramsey W. Haddad,et al.  Recursive layout generation , 1995, Proceedings Sixteenth Conference on Advanced Research in VLSI.

[45]  Jeremy Dion,et al.  Contour: a tile-based gridless router , 1995 .

[46]  James C. Hoe,et al.  START-NG: Delivering Seamless Parallel Computing , 1995, Euro-Par.

[47]  Jeffrey C. Mogul,et al.  The case for persistent-connection HTTP , 1995, SIGCOMM '95.

[48]  Dirk Grunwald,et al.  Performance issues in correlated branch prediction schemes , 1995, MICRO 1995.

[49]  Amitabh Srivastava,et al.  Analysis Tools , 2019, Public Transportation Systems.

[50]  Norman P. Jouppi,et al.  Register file design considerations in dynamically scheduled processors , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[51]  M. Martonosi,et al.  Informing Memory Operations: Providing Memory Performance Feedback in Modern Processors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[52]  Richard B. Gillett Memory Channel Network for PCI , 1996, IEEE Micro.

[53]  John L. Hennessy,et al.  SoftFLASH: analyzing the performance of clustered distributed virtual shared memory , 1996, ASPLOS VII.

[54]  S.K. Reinhardt,et al.  Decoupled Hardware Support for Distributed Shared Memory , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[55]  K. Gharachodoo,et al.  Memory consistency models for shared memory multiprocessors , 1996 .

[56]  James R. Larus,et al.  Implementing Fine-grain Distributed Shared Memory on Commodity SMP Workstations , 1996 .

[57]  A. Agarwal,et al.  MGS: A Multigrain Shared Memory System , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[58]  Sarita V. Adve,et al.  Shared Memory Consistency Models: A Tutorial , 1996, Computer.

[59]  R. Gillett,et al.  Overview of memory channel network for PCI , 1996, COMPCON '96. Technologies for the Information Superhighway Digest of Papers.

[60]  K. K. Ramakrishnan,et al.  Eliminating receive livelock in an interrupt-driven kernel , 1996, TOCS.

[61]  Anoop Gupta,et al.  Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, ISCA '90.