Integrated shared-memory and message-passing communication in the Alewife multiprocessor

To date, MIMD multiprocessors have been divided into two classes based on hardware communication models: those supporting shared memory and those supporting message passing. Breaking with tradition, this thesis argues that multiprocessors should integrate both communication mechanisms in a single hardware framework. Such integrated multiprocessors must address several architectural challenges that arise from integration. These challenges include the User-Level Access problem, the Service-Interleaving problem, and the Protocol Deadlock problem. The first involves which communication models are used for communication and how these models are accessed; the second involves avoiding livelocks and deadlocks introduced by multiple simultaneous streams of communication; and the third involves removing multi-node cycles in communication graphs. This thesis introduces these challenges and develops solutions in the context of Alewife, a large-scale multiprocessor. Solutions involve careful definition of communication semantics and interfaces to permit tradeoffs across the hardware/software boundary. Among other things, we will introduce the User-Direct Messaging model for message passing the transaction buffer framework for preventing cache-line thrashing, and two-case delivery for avoiding protocol deadlock. The Alewife prototype implements cache-coherent shared memory and user-level message passing in a single-chip Communications and Memory Management Unit (CMMU). The hardware mechanisms of the CMMU are coupled with a thin veneer of runtime software to support a uniform high-level communications interface. The CMMU employs a scalable cache-coherence scheme, functions with single-channel, bidirectional network, and directly supports up to 512 nodes. This thesis describes the design and implementation of the CMMU, associated processor-level interfaces, and runtime software. Included in our discussion is an implementation framework called service coupling, which permits efficient scheduling of highly contended resources (such as DRAM). This framework is well suited to integrated architectures. To evaluate the efficacy of the Alewife design, this thesis presents results from an operating 32-node Alewife machine. These results include microbenchmarks, to focus on individual mechanisms, and macrobenchmarks, in the form of applications and kernels from SPLASH and NAS benchmark suits. The large suite of working programs and resulting performance numbers lead us to one of our primary conclusions, namely that the integration of shared-memory and message-passing communication models is possible at a reasonable cost, and can be done with a level of efficiency that does not compromise either model. We conclude by discussing the extent to which the lessons of Alewife can be applied to future multiprocessors. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

[1]  Guy L. Steele,et al.  Data Optimization: Allocation of Arrays to Reduce Communication on SIMD Machines , 1990, J. Parallel Distributed Comput..

[2]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[3]  David Chaiken,et al.  Latency Tolerance through Multithreading in Large-Scale Multiprocessors , 1991 .

[4]  Henry M. Levy,et al.  Efficient Support for Multicomputing on ATM Networks , 1993 .

[5]  A. Gupta,et al.  The Stanford FLASH multiprocessor , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[6]  Seth Copen Goldstein,et al.  Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.

[7]  Ricardo Bianchini,et al.  The MIT Alewife machine: architecture and performance , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[8]  Anant Agarwal,et al.  Integrating message-passing and shared-memory: early experience , 1993, SIGP.

[9]  Anoop Gupta,et al.  The DASH Prototype: Logic Overhead and Performance , 1993, IEEE Trans. Parallel Distributed Syst..

[10]  Henry M. Levy,et al.  A comparison of message passing and shared memory architectures for data parallel programs , 1994, ISCA '94.

[11]  Anoop Gupta,et al.  Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[12]  David Chaiken,et al.  CACHE COHERENCE PROTOCOLS FOR LARGE-SCALE MULTIPROCESSORS , 1990 .

[13]  Thorsten von Eicken,et al.  U-Net: a user-level network interface for parallel and distributed computing , 1995, SOSP.

[14]  Donald Yeung,et al.  Sparcle: an evolutionary processor design for large-scale multiprocessors , 1993, IEEE Micro.

[15]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[16]  Timothy Mark Pinkston,et al.  On Deadlocks in Interconnection Networks , 1997, ISCA.

[17]  Wilson C. Hsieh,et al.  Optimistic active messages: a mechanism for scheduling communication with computation , 1995, PPOPP '95.

[18]  Andrew A. Chien,et al.  The J-Machine: A Fine Grain Concurrent Computer , 1989 .

[19]  Michel Dubois,et al.  Synchronization, coherence, and event ordering in multiprocessors , 1988, Computer.

[20]  James R. Larus,et al.  Cooperative shared memory: software and hardware for scalable multiprocessors , 1993, TOCS.

[21]  Douglas W. Clark Large-Scale Hardware Simulation: Modeling and Veri cation Strategies , 1999 .

[22]  Andrew A. Chien,et al.  Architecture of a message-driven processor , 1987, ISCA '87.

[23]  Mark D. Hill,et al.  Weak ordering—a new definition , 1998, ISCA '98.

[24]  Michael Gerndt,et al.  SUPERB: A tool for semi-automatic MIMD/SIMD parallelization , 1988, Parallel Comput..

[25]  Stefanos Kaxiras,et al.  Kiloprocessor Extensions to SCI , 1996, Proceedings of International Conference on Parallel Processing.

[26]  Marina C. Chen,et al.  Compiling Communication-Efficient Programs for Massively Parallel Machines , 1991, IEEE Trans. Parallel Distributed Syst..

[27]  Kirk L. Johnson,et al.  High-performance all-software distributed shared memory , 1996 .

[28]  Anant Agarwal,et al.  APRIL: a processor architecture for multiprocessing , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[29]  Monica S. Lam,et al.  Global optimizations for parallelism and locality on scalable parallel machines , 1993, PLDI '93.

[30]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[31]  Anoop Gupta,et al.  Complete computer system simulation: the SimOS approach , 1995, IEEE Parallel Distributed Technol. Syst. Appl..

[32]  A. A. Chein,et al.  A cost and speed model for k-ary n-cube wormhole routers , 1998 .

[33]  James R. Larus,et al.  Application-specific protocols for user-level shared memory , 1994, Proceedings of Supercomputing '94.

[34]  Burton J. Smith Architecture And Applications Of The HEP Multiprocessor Computer System , 1982, Optics & Photonics.

[35]  Kirk L. Johnson The impact of communication locality on large-scale multiprocessor performance , 1992, ISCA '92.

[36]  Ricardo Bianchini,et al.  Limits on the performance benefits of multithreading and prefetching , 1996, SIGMETRICS '96.

[37]  Stein Gjessing,et al.  Distributed-directory scheme: scalable coherent interface , 1990, Computer.

[38]  Eric A. Brewer,et al.  Remote queues: exposing message queues for optimization and atomicity , 1995, SPAA '95.

[39]  David Chaiken,et al.  The Alewife CMMU: Addressing the Multiprocessor Communications Gap , 1994 .

[40]  J. Larus,et al.  Tempest and Typhoon: user-level shared memory , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[41]  James R. Larus,et al.  Where is time spent in message-passing and shared-memory programs? , 1994, ASPLOS VI.

[42]  Anant Agarwal,et al.  LimitLESS directories: A scalable cache coherence scheme , 1991, ASPLOS IV.

[43]  James R. Larus,et al.  Cooperative shared memory: software and hardware for scalable multiprocessor , 1992, ASPLOS V.

[44]  Anne Rogers,et al.  Process decomposition through locality of reference , 1989, PLDI '89.

[45]  Seth Copen Goldstein,et al.  Active Messages: A Mechanism for Integrated Communication and Computation , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[46]  AgarwalAnant,et al.  Directory-Based Cache Coherence in Large-Scale Multiprocessors , 1990 .

[47]  Remzi H. Arpaci-Dusseau,et al.  Empirical evaluation of the CRAY-T3D: a compiler perspective , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[48]  James E. Smith,et al.  The ZS-1 central processor , 1987, ASPLOS.

[49]  Andrew A. Chien,et al.  The J-Machine: A Fine-Gain Concurrent Computer , 1989, IFIP Congress.

[50]  N. Madsen Divergence preserving discrete surface integral methods for Maxwell's curl equations using non-orthogonal unstructured grids , 1995 .

[51]  Daniel E. Lenoski,et al.  Scalable Shared-Memory Multiprocessing , 1995 .

[52]  Willy Zwaenepoel,et al.  Adaptive software cache management for distributed shared memory architectures , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[53]  Guang R. Gao,et al.  Polling Watchdog: Combining Polling and Interrupts for Efficient Message Handling , 1996, International Symposium on Computer Architecture.

[54]  David E. Culler,et al.  Monsoon: an explicit token-store architecture , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[55]  Charles L. Seitz,et al.  The design of the Caltech Mosaic C multicomputer , 1993 .

[56]  Robert H. B. Netzer,et al.  Detecting data races on weak memory systems , 1991, [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture.

[57]  Anant Agarwal,et al.  Anatomy of a Message in the Alewife Multiprocessor , 1993, The 8th IEEE Workshop on Computer Communications.

[58]  Donald Yeung,et al.  Multigrain shared memory , 2000, TOCS.

[59]  Erik Hagersten,et al.  DDM - A Cache-Only Memory Architecture , 1992, Computer.

[60]  John L. Hennessy,et al.  The performance advantages of integrating block data transfer in cache-coherent multiprocessors , 1994, ASPLOS VI.

[61]  M. J. Beckerle,et al.  T: integrated building blocks for parallel computing , 1993, Supercomputing '93.

[62]  Stefanos Kaxiras,et al.  The GLOW cache coherence protocol extensions for widely shared data , 1996, ICS '96.

[63]  Anoop Gupta,et al.  Programming for Different Memory Consistency Models , 1992, J. Parallel Distributed Comput..

[64]  Milon Mackey,et al.  An implementation of the Hamlyn sender-managed interface architecture , 1996, OSDI '96.

[65]  Anant Agarwal,et al.  Directory-based cache coherence in large-scale multiprocessors , 1990, Computer.

[66]  Victor Lee,et al.  Exploiting two-case delivery for fast protected messaging , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[67]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[68]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[69]  Beng-Hong Lim,et al.  Reactive synchronization algorithms for multiprocessors , 1994, ASPLOS VI.

[70]  A. Agarwal,et al.  MGS: A Multigrain Shared Memory System , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[71]  Marc Snir,et al.  The Communication Software and Parallel Environment of the IBM SP2 , 1995, IBM Syst. J..

[72]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[73]  Anoop Gupta,et al.  Exploring The Benefits Of Multiple Hardware Contexts In A Multiprocessor Architecture: Preliminary Results , 1989, The 16th Annual International Symposium on Computer Architecture.

[74]  W. Daniel Hillis,et al.  The Network Architecture of the Connection Machine CM-5 , 1996, J. Parallel Distributed Comput..

[75]  Chris J. Scheiman,et al.  Experience with active messages on the Meiko CS-2 , 1995, Proceedings of 9th International Parallel Processing Symposium.

[76]  A. Gupta,et al.  Exploring the benefits of multiple hardware contexts in a multiprocessor architecture: preliminary results , 1989, ISCA '89.

[77]  Anoop Gupta,et al.  Integration of message passing and shared memory in the Stanford FLASH multiprocessor , 1994, ASPLOS VI.

[78]  William J. Dally Virtual-channel flow control , 1990, ISCA '90.

[79]  Dana S. Henry,et al.  A tightly-coupled processor-network interface , 1992, ASPLOS V.

[80]  Anant Agarwal,et al.  Closing the window of vulnerability in multiphase memory transactions , 1992, ASPLOS V.

[81]  David Chaiken,et al.  Mechanisms and interfaces for software-extended coherent shared memory , 1994 .

[82]  Shekhar Y. Borkar,et al.  Supporting systolic and memory communication in iWarp , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[83]  William J. Dally,et al.  Deadlock-Free Adaptive Routing in Multicomputer Networks Using Virtual Channels , 1993, IEEE Trans. Parallel Distributed Syst..

[84]  Rajeev Barua,et al.  The sensitivity of communication mechanisms to bandwidth and latency , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[85]  William J. Dally,et al.  Deadlock-Free Message Routing in Multiprocessor Interconnection Networks , 1987, IEEE Transactions on Computers.

[86]  Steven L. Scott,et al.  Synchronization and communication in the T3E multiprocessor , 1996, ASPLOS VII.

[87]  Ken Kennedy,et al.  Software prefetching , 1991, ASPLOS IV.

[88]  Anoop Gupta,et al.  Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors , 1991, J. Parallel Distributed Comput..

[89]  Eric A. Brewer,et al.  PROTEUS: a high-performance parallel-architecture simulator , 1992, SIGMETRICS '92/PERFORMANCE '92.

[90]  Colin Whitby-Strevens The transputer , 1985, ISCA 1985.

[91]  Andrew A. Chien,et al.  The Cost of Adaptivity and Virtual Lanes in aWormhole Router , 1995 .

[92]  Kirk L. Johnson,et al.  CRL: high-performance all-software distributed shared memory , 1995, SOSP.

[93]  D. Lenoski,et al.  The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[94]  Anoop Gupta,et al.  The directory-based cache coherence protocol for the DASH multiprocessor , 1990, ISCA '90.

[95]  Shekhar Y. Borkar,et al.  iWarp: an integrated solution to high-speed parallel computing , 1988, Proceedings. SUPERCOMPUTING '88.

[96]  Anant Agarwal,et al.  FUGU: Implementing Translation and Protection in a Multiuser, Multimodel Multiprocessor , 1994 .

[97]  Allan Porterfield,et al.  Exploiting heterogeneous parallelism on a multithreaded multiprocessor , 1992, ICS '92.

[98]  William J. Dally,et al.  The J-machine Multicomputer: An Architectural Evaluation , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[99]  Peter Druschel,et al.  Experiences with a high-speed network adaptor: a software perspective , 1994, SIGCOMM 1994.

[100]  Frederic T. Chong,et al.  Parallel Communication Mechanisms for Sparse, Irregular Applications , 1997 .

[101]  Babak Falsafi,et al.  Coherent Network Interfaces for Fine-Grain Communication , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[102]  William A. Wulf,et al.  Evaluation of the WM Architecture , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[103]  Richard A. Lethin,et al.  Message-driven dynamics , 1997 .