Synchronization and communication in the T3E multiprocessor

This paper describes the synchronization and communication primitives of the Cray T3E multiprocessor, a shared memory system scalable to 2048 processors. We discuss what we have learned from the T3D project (the predecessor to the T3E) and the rationale behind changes made for the T3E. We include performance measurements for various aspects of communication and synchronization.The T3E augments the memory interface of the DEC 21164 microprocessor with a large set of explicitly-managed, external registers (E-registers). E-registers are used as the source or target for all remote communication. They provide a highly pipelined interface to global memory that allows dozens of requests per processor to be outstanding. Through E-registers, the T3E provides a rich set of atomic memory operations and a flexible, user-level messaging facility. The T3E also provides a set of virtual hardware barrier/eureka networks that can be arbitrarily embedded into the 3D torus interconnect.

[1]  Anant Agarwal,et al.  Limits on Interconnection Network Performance , 1991, IEEE Trans. Parallel Distributed Syst..

[2]  Ralph Grishman,et al.  The NYU Ultracomputer—Designing an MIMD Shared Memory Parallel Computer , 1983, IEEE Transactions on Computers.

[3]  Patricia J. Teller Translation-lookaside buffer consistency , 1990, Computer.

[4]  Maged M. Michael,et al.  Simple, fast, and practical non-blocking and blocking concurrent queue algorithms , 1996, PODC '96.

[5]  Michael D. Noakes,et al.  The J-machine multicomputer: an architectural evaluation , 1993, ISCA '93.

[6]  David L. Black,et al.  Translation lookaside buffer consistency: a software approach , 1989, ASPLOS III.

[7]  Anoop Gupta,et al.  The directory-based cache coherence protocol for the DASH multiprocessor , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[8]  Anoop Gupta,et al.  The Stanford FLASH multiprocessor , 1994, ISCA '94.

[9]  Anoop Gupta,et al.  The directory-based cache coherence protocol for the DASH multiprocessor , 1990, ISCA '90.

[10]  Tom MacDonald,et al.  The CRAFf Fortran Programming Model , 2014 .

[11]  Richard E. Kessler,et al.  Evaluating stream buffers as a secondary cache replacement , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[12]  Subhash Saini,et al.  NAS Parallel Benchmarks Results 3-95 , 1995 .

[13]  Steven L. Scott,et al.  The Cray T3E Network: Adaptive Routing in a High Performance 3D Torus , 1996 .

[14]  Remzi H. Arpaci-Dusseau,et al.  Empirical evaluation of the CRAY-T3D: a compiler perspective , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[15]  Dana S. Henry,et al.  A tightly-coupled processor-network interface , 1992, ASPLOS V.

[16]  Jon Beecroft,et al.  Meiko CS-2 Interconnect Elan-Elite Design , 1994, Parallel Comput..

[17]  Maurice Herlihy,et al.  Wait-free synchronization , 1991, TOPL.

[18]  Piyush Mehrotra,et al.  Vienna Fortran—a Fortran language extension for distributed memory multiprocessors , 1992 .

[19]  Donald Yeung,et al.  The MIT Alewife machine: architecture and performance , 1995, ISCA '98.

[20]  Maurice Herlihy,et al.  A methodology for implementing highly concurrent data objects , 1993, TOPL.

[21]  Steve Scott The GigaRing Channel , 1996, IEEE Micro.

[22]  Kevin P. McAuliffe,et al.  The IBM Research Parallel Processor Prototype (RP3): Introduction and Architecture , 1985, ICPP.

[23]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[24]  Andrew A. Chien,et al.  A comparison of architectural support for messaging in the TMC CM-5 and the Cray T3D , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[25]  James R. Larus,et al.  Tempest and typhoon: user-level shared memory , 1994, ISCA '94.

[26]  CORPORATE Ncube The NCUBE family of high-performance parallel computer systems , 1988, C3P.

[27]  Robert W. Numrich,et al.  Measurement of Communication Rates on the Cray T3D Interprocessor Network , 1994, HPCN.

[28]  Arvind,et al.  T: A Multithreaded Massively Parallel Architecture , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[29]  Gregory F. Pfister,et al.  “Hot spot” contention and combining in multistage interconnection networks , 1985, IEEE Transactions on Computers.

[30]  A. Gottleib,et al.  The nyu ultracomputer- designing a mimd shared memory parallel computer , 1983 .

[31]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[32]  Jack Dongarra,et al.  Pvm: A Users' Guide and Tutorial for Network Parallel Computing , 1994 .

[33]  Martin Walker,et al.  A Shared Memory MPP from Cray Research , 1994, Digit. Tech. J..

[34]  Cathy May,et al.  The PowerPC Architecture: A Specification for a New Family of RISC Processors , 1994 .

[35]  David K. Bradley First and second generation hypercube performance , 1988 .