Mechanisms and interfaces for software-extended coherent shared memory

Software-extended systems use a combination of hardware and software to implement shared memory on large-scale multiprocessors. Hardware mechanisms accelerate common-case accesses, while software handles exceptional events. In order to provide fast memory access, this design strategy requires appropriate hardware mechanisms including caches, location-independent addressing, limited directories, processor access to the network, and a memory-system interrupt. Software-extended systems benefit from the flexibility of software, but they require a well-designed interface between their hardware and software components to do so. This dissertation proposes, designs, tests, measures, and models the novel softwareextended memory system of Alewife, a large-scale multiprocessor architecture. A working Alewife machine validates the design, and detailed simulations of the architecture (with up to 256 processors) show the cost versus performance trade-offs involved in building distributed shared memory. The architecture with a five-pointer LimitLESS directory achieves between 71% and 100% of full-map directory performance at a constant cost per processing element. A worker-set model uses a description of application behavior and architectural mechanisms to predict the performance of software-extended systems. The model shows that software-extended systems exhibit little sensitivity to trap latency and memorysystem code efficiency, as long as they implement a minimum of one directory pointer in hardware. Low-cost, software-only directories with no hardware pointers are very sensitive to trap latency and code efficiency, even in systems that implement special optimizations for intranode accesses. Alewife’s flexible coherence interface facilitates the development of memory-system software and enables a smart memory system, which uses intelligence to help improve performance. This type of system uses information about applications’ dynamic use of shared memory to optimize performance, with and without help from programmers. An automatic optimization technique transmits information about memory usage from the runtime system to the compiler. The compiler uses this information to optimize accesses to widely-shared, read-only data and improves one benchmark’s performance by 22%. Other smart memory features include human-readable profiles of sharedmemory accesses and protocols that adapt dynamically to memory reference patterns.

[1]  Anant Agarwal,et al.  Anatomy of a Message in the Alewife Multiprocessor , 1993, The 8th IEEE Workshop on Computer Communications.

[2]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[3]  Donald Yeung,et al.  THE MIT ALEWIFE MACHINE: A LARGE-SCALE DISTRIBUTED-MEMORY MULTIPROCESSOR , 1991 .

[4]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[5]  Anant Agarwal,et al.  APRIL: a processor architecture for multiprocessing , 1990, ISCA '90.

[6]  Richard P. LaRowe,et al.  Hiding Shared Memory Reference Latency on the Galactica Net Distributed Shared Memory Architecture , 1992, J. Parallel Distributed Comput..

[7]  Calvin K. Tang Cache system design in the tightly coupled multiprocessor system , 1976, AFIPS '76.

[8]  David L. Black,et al.  Machine-independent virtual memory management for paged uniprocessor and multiprocessor architectures , 1987, IEEE Trans. Computers.

[9]  A. Richard Newton,et al.  An empirical evaluation of two memory-efficient directory methods , 1990, ISCA '90.

[10]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[11]  James R. Larus,et al.  Cachier: A Tool for Automatically Inserting CICO Annotations , 1994, 1994 Internatonal Conference on Parallel Processing Vol. 2.

[12]  Seth Copen Goldstein,et al.  Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.

[13]  Anoop Gupta,et al.  The DASH Prototype: Logic Overhead and Performance , 1993, IEEE Trans. Parallel Distributed Syst..

[14]  Brian N. Bershad,et al.  The Midway distributed shared memory system , 1993, Digest of Papers. Compcon Spring.

[15]  David J. Lilja,et al.  Improving Memory Utilization in Cache Coherence Directories , 1993, IEEE Trans. Parallel Distributed Syst..

[16]  Charles M. Flaig VLSI Mesh Routing Systems , 1987 .

[17]  David Chaiken,et al.  CACHE COHERENCE PROTOCOLS FOR LARGE-SCALE MULTIPROCESSORS , 1990 .

[18]  Stein Gjessing,et al.  Distributed-directory scheme: scalable coherent interface , 1990, Computer.

[19]  David R. Cheriton,et al.  Software-Controlled Caches in the VMP Multiprocessor , 1986, ISCA.

[20]  Anant Agarwal,et al.  Integrating message-passing and shared-memory: early experience , 1993, SIGP.

[21]  M. Hill,et al.  Weak ordering-a new definition , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[22]  Anoop Gupta,et al.  Performance evaluation of hybrid hardware and software distributed shared memory protocols , 1994, ICS '94.

[23]  Donald Yeung,et al.  Sparcle: an evolutionary processor design for large-scale multiprocessors , 1993, IEEE Micro.

[24]  John L. Hennessy,et al.  Evaluating the memory overhead required for COMA architectures , 1994, ISCA '94.

[25]  Eric A. Brewer,et al.  Portable high-performance superconducting: high-level platform-dependent optimization , 1994 .

[26]  Robert J. Fowler,et al.  Adaptive cache coherency for detecting migratory shared data , 1993, ISCA '93.

[27]  Anant Agarwal,et al.  Directory-based cache coherence in large-scale multiprocessors , 1990, Computer.

[28]  Ray Jain,et al.  The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[29]  Arvind,et al.  T: a multithreaded massively parallel architecture , 1992, ISCA '92.

[30]  James R. Larus,et al.  Cooperative shared memory: software and hardware for scalable multiprocessors , 1993, TOCS.

[31]  Robert H. Halstead,et al.  Mul-T: a high-performance parallel Lisp , 1989, PLDI '89.

[32]  Mary K. Vernon,et al.  A Hybrid Shared Memory/Message Passing Parallel Machine , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[33]  J. Kubiatowicz Closing the Window of Vulnerability in Multiphase memory transaction: The alewife transaction store , 1993 .

[34]  Mats Brorsson,et al.  An adaptive cache coherence protocol optimized for migratory sharing , 1993, ISCA '93.

[35]  Anoop Gupta,et al.  Analysis of cache invalidation patterns in multiprocessors , 1989, ASPLOS III.

[36]  David R. Cheriton,et al.  Software-controlled caches in the VMP multiprocessor , 1986, ISCA 1986.

[37]  Carla Schlatter Ellis,et al.  The robustness of NUMA memory management , 1991, SOSP '91.

[38]  Erik Hagersten,et al.  DDM - A Cache-Only Memory Architecture , 1992, Computer.

[39]  Paul Hudak,et al.  ORBIT: an optimizing compiler for scheme , 1986, SIGPLAN '86.

[40]  J. Davenport Editor , 1960 .

[41]  Anoop Gupta,et al.  Comparative evaluation of latency reducing and tolerating techniques , 1991, ISCA '91.

[42]  Robert P. Colwell,et al.  A VLIW architecture for a trace scheduling compiler , 1987, ASPLOS.

[43]  Andrew W. Appel,et al.  Virtual memory primitives for user programs , 1991, ASPLOS IV.

[44]  Peter J. Denning,et al.  Working Sets Past and Present , 1980, IEEE Transactions on Software Engineering.

[45]  Anoop Gupta,et al.  The directory-based cache coherence protocol for the DASH multiprocessor , 1990, ISCA '90.

[46]  Paul Feautrier,et al.  A New Solution to Coherence Problems in Multicache Systems , 1978, IEEE Transactions on Computers.

[47]  David Chaiken,et al.  Latency Tolerance through Multithreading in Large-Scale Multiprocessors , 1991 .

[48]  Anoop Gupta,et al.  The Stanford FLASH multiprocessor , 1994, ISCA '94.

[49]  Alan L. Cox,et al.  TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems , 1994, USENIX Winter.

[50]  Anant Agarwal,et al.  Analyzing multiprocessor cache behavior through data reference modeling , 1993, SIGMETRICS '93.

[51]  Paul Hudak,et al.  Memory coherence in shared virtual memory systems , 1989, TOCS.

[52]  Robert J. Fowler,et al.  The implementation of a coherent memory abstraction on a NUMA multiprocessor: experiences with platinum , 1989, SOSP '89.

[53]  Willy Zwaenepoel,et al.  Implementation and performance of Munin , 1991, SOSP '91.

[54]  Kai Li,et al.  IVY: A Shared Virtual Memory System for Parallel Computing , 1988, ICPP.

[55]  James R. Larus,et al.  Tempest and typhoon: user-level shared memory , 1994, ISCA '94.

[56]  Anant Agarwal,et al.  LimitLESS directories: A scalable cache coherence scheme , 1991, ASPLOS IV.

[57]  Anant Agarwal,et al.  Software-extended coherent shared memory: performance and cost , 1994, ISCA '94.

[58]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and pre , 1990, ISCA 1990.

[59]  James R. Larus,et al.  Cooperative shared memory: software and hardware for scalable multiprocessor , 1992, ASPLOS V.

[60]  AgarwalAnant,et al.  Directory-Based Cache Coherence in Large-Scale Multiprocessors , 1990 .

[61]  Henri E. Bal,et al.  Object distribution in Orca using Compile-Time and Run-Time techniques , 1993, OOPSLA '93.

[62]  John L. Hennessy,et al.  Mtool: An Integrated System for Performance Debugging Shared Memory Multiprocessor Applications , 1993, IEEE Trans. Parallel Distributed Syst..

[63]  Alan L. Cox,et al.  Software versus hardware shared-memory implementation: a case study , 1994, ISCA '94.

[64]  M. M. Cherian A STUDY OF BACKOFF BARRIER SYNCHRONIZATION , 1989 .

[65]  Mark Horowitz,et al.  Modeling the performance of limited pointers directories for cache coherence , 1991, ISCA '91.

[66]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[67]  James R. Larus,et al.  Mechanisms for cooperative shared memory , 1993, ISCA '93.

[68]  Charles L. Seitz,et al.  Concurrent VLSI Architectures , 1984, IEEE Transactions on Computers.

[69]  Anoop Gupta,et al.  Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[70]  Anant Agarwal,et al.  Closing the window of vulnerability in multiphase memory transactions , 1992, ASPLOS V.

[71]  Jr. Richard Thomas Simoni,et al.  Cache coherence directories for scalable multiprocessors , 1992 .

[72]  Mark Horowitz Dynamic Pointer Allocation for Scalable Cache Coherence Directories , 1991 .