Coherent Network Interfaces for Fine-Grain Communication

Historically, processor accesses to memory-mapped device registers have been marked uncachable to insure their visibility to the device. The ubiquity of snooping cache coherence, however, makes it possible for processors and devices to interact with cachable, coherent memory operations. Using coherence can improve performance by facilitating burst transfers of whole cache blocks and reducing control overheads (e.g., for polling).This paper begins an exploration of network interfaces (NIs) that use coherence---coherent network interfaces (CNIs)---to improve communication performance. We restrict this study to NI/CNIs that reside on coherent memory or I/O buses, to NI/CNIs that are much simpler than processors, and to the performance of fine-grain messaging from user process to user process.Our first contribution is to develop and optimize two mechanisms that CNIs use to communicate with processors. A cachable device register---derived from cachable control registers [39,40]---is a coherent, cachable block of memory used to transfer status, control, or data between a device and a processor. Cachable queues generalize cachable device registers from one cachable, coherent memory block to a contiguous region of cachable, coherent blocks managed as a circular queue.Our second contribution is a taxonomy and comparison of four CNIs with a more conventional NI. Microbenchmark results show that CNIs can improve the round-trip latency and achievable bandwidth of a small 64-byte message by 37% and 125% respectively on the memory bus and 74% and 123% respectively on a coherent I/O bus. Experiments with five macrobenchmarks show that CNIs can improve the performance by 17-53% on the memory bus and 30-88% on the I/O bus.

[1]  Donald Yeung,et al.  The MIT Alewife machine: architecture and performance , 1995, ISCA '98.

[2]  Kai Li,et al.  Retrospective: virtual memory mapped network interface for the SHRIMP multicomputer , 1994, ISCA '98.

[3]  S.K. Reinhardt,et al.  Decoupled Hardware Support for Distributed Shared Memory , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[4]  W. Daniel Hillis,et al.  The Network Architecture of the Connection Machine CM-5 , 1996, J. Parallel Distributed Comput..

[5]  Kai Li,et al.  Protected, user-level DMA for the SHRIMP network interface , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[6]  James C. Hoe,et al.  START-NG: Delivering Seamless Parallel Computing , 1995, Euro-Par.

[7]  James R. Larus,et al.  Efficient support for irregular applications on distributed-memory machines , 1995, PPOPP '95.

[8]  Eric A. Brewer,et al.  Remote queues: exposing message queues for optimization and atomicity , 1995, SPAA '95.

[9]  Fredrik Dahlgren Boosting the performance of hybrid snooping cache protocols , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[10]  Andrew A. Chien,et al.  A comparison of architectural support for messaging in the TMC CM-5 and the Cray T3D , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[11]  Ruby B. Lee,et al.  Tempest: a substrate for portable parallel programs , 1995 .

[12]  C. Thompson Special Interest Group , 1995 .

[13]  David A. Wood,et al.  Cost-Effective Parallel Computing , 1995, Computer.

[14]  James R. Larus,et al.  Tempest: a substrate for portable parallel programs , 1995, Digest of Papers. COMPCON'95. Technologies for the Information Superhighway.

[15]  Robert W. Pfile,et al.  Typhoon-Zero Implementation: The Vortex Module , 1995 .

[16]  James R. Larus,et al.  Application-specific protocols for user-level shared memory , 1994, Proceedings of Supercomputing '94.

[17]  James R. Larus,et al.  Where is time spent in message-passing and shared-memory programs? , 1994, ASPLOS VI.

[18]  Anoop Gupta,et al.  Integration of message passing and shared memory in the Stanford FLASH multiprocessor , 1994, ASPLOS VI.

[19]  Anoop Gupta,et al.  The performance impact of flexibility in the Stanford FLASH multiprocessor , 1994, ASPLOS VI.

[20]  Anant Agarwal,et al.  FUGU: Implementing Translation and Protection in a Multiuser, Multimodel Multiprocessor , 1994 .

[21]  A. Gupta,et al.  The Stanford FLASH multiprocessor , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[22]  J. Larus,et al.  Tempest and Typhoon: user-level shared memory , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[23]  Andrew A. Chien,et al.  Integrating networks and memory hierarchies in a multicomputer node architecture , 1994, Proceedings of 8th International Parallel Processing Symposium.

[24]  Anoop Gupta,et al.  The Stanford FLASH multiprocessor , 1994, ISCA '94.

[25]  Shlomo Weiss,et al.  POWER and PowerPC , 1994 .

[26]  David L Weaver,et al.  The SPARC architecture manual : version 9 , 1994 .

[27]  D. Culler,et al.  Parallel programming in Split-C , 1993, Supercomputing '93. Proceedings.

[28]  Duncan Roweth Computing Surface 2 , 1993, Supercomputer.

[29]  Anant Agarwal,et al.  Anatomy of a message in the Alewife multiprocessor , 1993, ICS '93.

[30]  Dana S. Henry,et al.  A tightly-coupled processor-network interface , 1992, ASPLOS V.

[31]  Anant Agarwal,et al.  Closing the window of vulnerability in multiphase memory transactions , 1992, ASPLOS V.

[32]  W. Daniel Hillis,et al.  The network architecture of the Connection Machine CM-5 (extended abstract) , 1992, SPAA '92.

[33]  Hiroaki Ishihata,et al.  Low-Latency Message Communication Support for the AP1000 , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[34]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[35]  Seth Copen Goldstein,et al.  Active Messages: A Mechanism for Integrated Communication and Computation , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[36]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[37]  Shared-Memory Multiprocessors,et al.  Algorithms for Scalable Synchronization on , 1991 .

[38]  Philip M Evans The sparc architecture manual , 1991 .

[39]  Shekhar Y. Borkar,et al.  Supporting systolic and memory communication in iWarp , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[40]  Andrew A. Chien,et al.  The J-Machine: A Fine Grain Concurrent Computer , 1989 .

[41]  J. Goodman,et al.  The Wisconsin Multicube: a new large-scale cache-coherent multiprocessor , 1988, [1988] The 15th Annual International Symposium on Computer Architecture. Conference Proceedings.

[42]  A. Agarwal,et al.  An evaluation of directory schemes for cache coherence , 1988, [1988] The 15th Annual International Symposium on Computer Architecture. Conference Proceedings.

[43]  Alan Jay Smith,et al.  A class of compatible cache consistency protocols and their support by the IEEE futurebus , 1986, ISCA '86.

[44]  Parallelizing appbt for a shared- memory multiprocessor , 1985 .

[45]  M. Karplus,et al.  CHARMM: A program for macromolecular energy, minimization, and dynamics calculations , 1983 .

[46]  P. H. Lindsay Human Information Processing , 1977 .