Design and implementation of a multipurpose cluster system network interface unit

Today, the interface between a high speed network and a high performance computation node is the least mature hardware technology in scalable general purpose cluster computing. Currently, the one-interface-fits-all philosophy prevails. This approach performs poorly in some cases because of the complexity of modern memory hierarchy and the wide range of communication sizes and patterns. Today's message passing NIU's are also unable to utilize the best data transfer and coordination mechanisms due to poor integration into the computation node's memory hierarchy. These shortcomings unnecessarily constrain the performance of cluster systems. Our thesis is that a cluster system NIU should support multiple communication interfaces layered on a virtual message queue substrate in order to streamline data movement both within each node as well as between nodes. The NIU should be tightly integrated into the computation node's memory hierarchy via the cache-coherent snoopy system bus so as to gain access to a rich set of data movement operations. We further propose to achieve the goal of a large set of high performance communication functions with a hybrid NIU micro-architecture that combines custom hardware building blocks with an off-the-shelf embedded processor. These ideas are tested through the design and implementation of the StarT-Voyager NES, an NIU used to connect a cluster of commercial PowerPC based SMP's. Our prototype demonstrates that it is feasible to implement a multi-interface NIU at reasonable hardware cost. This is achieved by reusing a set of basic hardware building blocks and adopting a layered architecture that separates protected network sharing from software visible communication interfaces. Through different mechanisms, our 35MHz NIU (140MHz processor core) can deliver very low latency for very short messages (under 2 m s), very high bandwidth for multi-kilobyte block transfers (167 MBytes/s bi-directional bandwidth), and very low processor overhead for multi-cast communication (each additional destination after the first incurs 10 processor clocks). We introduce the novel idea of supporting a large number of virtual message queues through a combination of hardware Resident message queues and firmware emulated Non-resident message queues. By using the Resident queues as firmware controlled caches, our implementation delivers hardware speed on the average while providing graceful degradation in a low cost implementation. Finally, we also demonstrate that an off-the-shelf embedded processor complements custom hardware in the NIU, with the former providing flexibility and the latter performance. We identify the interface between the embedded processor and custom hardware as a critical design component and propose a command and completion queue interface to improve the performance and reduce the complexity of embedded firmware. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

[1]  Eric A. Brewer,et al.  Remote queues: exposing message queues for optimization and atomicity , 1995, SPAA '95.

[2]  Vivek Sarkar,et al.  Location Consistency: Stepping Beyond the Memory Coherence Barrier , 1995, ICPP.

[3]  Erik Hagersten,et al.  Simple COMA node implementations , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[4]  Michael S. Ehrlich,et al.  StarT-jr : a parallel system from commodity technology , 1997 .

[5]  Mark D. Hill,et al.  The impact of data transfer and buffering alternatives on network interface design , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[6]  G. A. Boughton,et al.  Arctic routing chip , 1994, Symposium Record Hot Interconnects II.

[7]  Milon Mackey,et al.  An implementation of the Hamlyn sender-managed interface architecture , 1996, OSDI '96.

[8]  William J. Dally,et al.  The message-driven processor: a multicomputer processing node with efficient mechanisms , 1992, IEEE Micro.

[9]  Cathy May,et al.  The PowerPC Architecture: A Specification for a New Family of RISC Processors , 1994 .

[10]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[11]  John B. Carter,et al.  An argument for simple COMA , 1995, Future Gener. Comput. Syst..

[12]  Jack Dongarra,et al.  PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing , 1995 .

[13]  Charles E. Leiserson,et al.  Fat-trees: Universal networks for hardware-efficient supercomputing , 1985, IEEE Transactions on Computers.

[14]  Michael L. Scott,et al.  Software cache coherence for large scale multiprocessors , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[15]  Michel Dubois,et al.  Formal verification of delayed consistency protocols , 1996, Proceedings of International Conference on Parallel Processing.

[16]  Thorsten von Eicken,et al.  Evolution of the Virtual Interface Architecture , 1998, Computer.

[17]  P. Pierce,et al.  The NX/2 operating system , 1988, C3P.

[18]  David L. Dill,et al.  The Murphi Verification System , 1996, CAV.

[19]  Donald Yeung,et al.  Sparcle: an evolutionary processor design for large-scale multiprocessors , 1993, IEEE Micro.

[20]  James K. Archibald,et al.  Cache coherence protocols: evaluation using a multiprocessor simulation model , 1986, TOCS.

[21]  Scott Pakin,et al.  Fast messages: efficient, portable communication for workstation clusters and MPPs , 1997, IEEE Concurrency.

[22]  David L. Dill,et al.  Verification of Cache Coherence Protocols by Aggregation of Distributed Transactions , 1998, Theory of Computing Systems.

[23]  R. E. Kessler,et al.  Cray T3D: a new dimension for Cray Research , 1993, Digest of Papers. Compcon Spring.

[24]  Richard B. Gillett Memory Channel Network for PCI , 1996, IEEE Micro.

[25]  James R. Larus,et al.  Fine-grain access control for distributed shared memory , 1994, ASPLOS VI.

[26]  James R. Goodman,et al.  Efficient Synchronization: Let Them Eat QOLB , 1997, International Symposium on Computer Architecture.

[27]  Somesh Jha,et al.  Verification of the Futurebus+ cache coherence protocol , 1993, Formal Methods Syst. Des..

[28]  Jack J. Dongarra,et al.  A message passing standard for MPP and workstations , 1996, CACM.

[29]  Anoop Gupta,et al.  The DASH prototype: implementation and performance , 1992, ISCA '92.

[30]  Jean-Loup Baer,et al.  Two techniques for improving performance on bus-based multiprocessors , 1995, Future Gener. Comput. Syst..

[31]  Michel Dubois,et al.  Combined performance gains of simple cache protocol extensions , 1994, ISCA '94.

[32]  Michael L. Scott,et al.  Synchronization without contention , 1991, ASPLOS IV.

[33]  Tom Lovett,et al.  STiNG: A CC-NUMA Computer System for the Commercial Marketplace , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[34]  James C. Hoe,et al.  START-NG: Delivering Seamless Parallel Computing , 1995, Euro-Par.

[35]  Rishiyur S. Nikhil,et al.  Cid: A Parallel, "Shared-Memory" C for Distributed-Memory Machines , 1994, LCPC.

[36]  D.A. Wood,et al.  Reactive NUMA: A Design For Unifying S-COMA And CC-NUMA , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[37]  M. J. Beckerle,et al.  T: integrated building blocks for parallel computing , 1993, Supercomputing '93.

[38]  Pat Helland,et al.  The Mercury Interconnect Architecture: A Cost-effective Infrastructure For High-performance Servers , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[39]  Arvind,et al.  T: A Multithreaded Massively Parallel Architecture , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[40]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[41]  Victor Luchangco,et al.  Computation-centric memory models , 1998, SPAA '98.

[42]  Victor Lee,et al.  Exploiting two-case delivery for fast protected messaging , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[43]  John Kubiatowicz,et al.  Integrated shared-memory and message-passing communication in the Alewife multiprocessor , 1998 .

[44]  William J. Dally,et al.  Architecture and implementation of the reliable router , 1994, Symposium Record Hot Interconnects II.

[45]  Michael Alexander,et al.  Designing the PowerPC 60X bus , 1994, IEEE Micro.

[46]  J. Larus,et al.  Tempest and Typhoon: user-level shared memory , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[47]  Anoop Gupta,et al.  Optimized multiprocessor communication and synchronization using a programmable protocol engine , 1998 .

[48]  Mary K. Vernon,et al.  Efficient synchronization primitives for large-scale cache-coherent multiprocessors , 1989, ASPLOS III.

[49]  James C. Hoe StarT-X - A One-Man-Year Exercise in Network Interface Engineering , 1998 .

[50]  Seth Copen Goldstein,et al.  Active Messages: A Mechanism for Integrated Communication and Computation , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[51]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[52]  David Chaiken,et al.  The Alewife CMMU: Addressing the Multiprocessor Communications Gap , 1994 .

[53]  Alan L. Cox,et al.  TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems , 1994, USENIX Winter.

[54]  Brian N. Bershad,et al.  The Midway distributed shared memory system , 1993, Digest of Papers. Compcon Spring.

[55]  Kourosh Gharachorloo,et al.  Shasta: a low overhead, software-only approach for supporting fine-grain shared memory , 1996, ASPLOS VII.

[56]  Josep Torrellas,et al.  The Augmint multiprocessor simulation toolkit for Intel x86 architectures , 1996, Proceedings International Conference on Computer Design. VLSI in Computers and Processors.

[57]  Charles L. Seitz,et al.  Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.

[58]  Steven L. Scott,et al.  Synchronization and communication in the T3E multiprocessor , 1996, ASPLOS VII.

[59]  Ronald Minnich,et al.  The memory integrated network interface , 1994, Symposium Record Hot Interconnects II.

[60]  W. Daniel Hillis,et al.  The Network Architecture of the Connection Machine CM-5 , 1996, J. Parallel Distributed Comput..

[61]  Larry Rudolph,et al.  CACHET: an adaptive cache coherence protocol for distributed shared-memory systems , 1999, ICS '99.

[62]  Paul Hudak,et al.  Memory coherence in shared virtual memory systems , 1986, PODC '86.

[63]  Larry Rudolph,et al.  Commit-reconcile & fences (CRF): a new memory model for architects and compiler writers , 1999, ISCA.

[64]  James C. Hoe Effective parallel computation on workstation cluster with a user-level communication network , 1994 .

[65]  Pong Fong Symbolic state model: a new approach for the verification of cache coherence protocols , 1996 .

[66]  Document for a Standard Message-Passing Interface , 1993 .

[67]  Kenneth M. Mackenzie,et al.  An efficient virtual network interface in the FUGU scalable workstation dc by Kenneth Martin Mackenzie , 1998 .

[68]  John L. Hennessy,et al.  The FLASH Multiprocessor: Designing a Flexible and Scalable System , 1998 .

[69]  Kenneth L. McMillan,et al.  Symbolic model checking: an approach to the state explosion problem , 1992 .

[70]  Kirk L. Johnson,et al.  CRL: high-performance all-software distributed shared memory , 1995, SOSP.

[71]  D. Lenoski,et al.  The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[72]  Anoop Gupta,et al.  The directory-based cache coherence protocol for the DASH multiprocessor , 1990, ISCA '90.

[73]  David A. Wood,et al.  Decoupled Hardware Support for Distributed Shared Memory , 1996, ISCA.

[74]  Jon Beecroft,et al.  Meiko CS-2 Interconnect Elan-Elite Design , 1994, Parallel Comput..

[75]  Tilak Agerwala,et al.  SP2 System Architecture , 1999, IBM Syst. J..

[76]  David L. Dill,et al.  Verification of FLASH cache coherence protocol by aggregation of distributed transactions , 1996, SPAA '96.

[77]  Anna R. Karlin,et al.  Competitive snoopy caching , 1986, 27th Annual Symposium on Foundations of Computer Science (sfcs 1986).

[78]  Mike Galles Spider: a high-speed network interconnect , 1997, IEEE Micro.

[79]  G. Andrew Boughton Arctic Switch Fabric , 1997, PCRCW.

[80]  Babak Falsafi,et al.  Coherent Network Interfaces for Fine-Grain Communication , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[81]  Scott B. Marovich,et al.  Hamlyn: a high-performance network interface with sender-based memory management , 1995 .

[82]  Allan Porterfield,et al.  The Tera computer system , 1990 .

[83]  Greg J. Regnier,et al.  The Virtual Interface Architecture , 2002, IEEE Micro.

[84]  Anant Agarwal,et al.  FUGU: Implementing Translation and Protection in a Multiuser, Multimodel Multiprocessor , 1994 .

[85]  Liviu Iftode,et al.  Scope Consistency: A Bridge between Release Consistency and Entry Consistency , 1996, SPAA '96.

[86]  Michael C. Browne,et al.  Exploiting Parallelism in Cache Coherency Protocol Engines , 1995, Euro-Par.

[87]  A. Gupta,et al.  The Stanford FLASH multiprocessor , 1994, Proceedings of 21 International Symposium on Computer Architecture.