Implementing Application-Specific Cache-Coherence Protocols in Configurable Hardware

Streamlining communication is key to achieving good performance in shared-memory parallel programs. While full hardware support for cache coherence generally offers the best performance, not all parallel machines provide it. Instead, software layers using Shared Virtual Memory (SVM) can be built to enforce coherence at a higher level. In prior work, researchers have studied application-specific cache coherence protocols implemented either in SVM systems or as handlers run by programmable protocol processors. Since the protocols are specialized to the needs of a single application, they can be particularly helpful in reducing the long latencies and processing overhead that sometimes degrade performance in SVM systems. This paper studies implementing application-specific protocols in hardware, but not via an instruction-based protocol processor as is typical. Instead, we consider configurable implementations based on Field-Programmable Gate Arrays (FPGAs). This approach can be faster than software-based techniques and less expensive than some hardware-based techniques. We study one application, appbt, in detail, including a VHDL-level design of the configurable protocol design. We sketch out approaches for other applications as well. Implementing protocol operations in configurable hardware improves communication performance by roughly 11X for a 32-node system. While overall speedups are a more modest 12% our method is promising because of its flexibility and because it offers a new way of harnessing configurable hardware at the network interface, where it already exists or could be easily added to current systems.

[1]  Paul Hudak,et al.  Memory coherence in shared virtual memory systems , 1989, TOCS.

[2]  T. von Eicken,et al.  Parallel programming in Split-C , 1993, Supercomputing '93.

[3]  Jack E. Veenstra,et al.  Mint Tutorial and User Manual , 1993 .

[4]  Anoop Gupta,et al.  The DASH Prototype: Logic Overhead and Performance , 1993, IEEE Trans. Parallel Distributed Syst..

[5]  James R. Larus,et al.  Tempest and typhoon: user-level shared memory , 1994, ISCA '94.

[6]  James R. Larus,et al.  Application-specific protocols for user-level shared memory , 1994, Proceedings of Supercomputing '94.

[7]  James R. Larus,et al.  Where is time spent in message-passing and shared-memory programs? , 1994, ASPLOS VI.

[8]  Anoop Gupta,et al.  The performance impact of flexibility in the Stanford FLASH multiprocessor , 1994, ASPLOS VI.

[9]  Jean-Francois Guillaud,et al.  PC/ATM interface accelerator using reconfigurable technology , 1995, Other Conferences.

[10]  Charles L. Seitz,et al.  Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.

[11]  James R. Larus,et al.  Tempest: a substrate for portable parallel programs , 1995, Digest of Papers. COMPCON'95. Technologies for the Information Superhighway.

[12]  Ruby B. Lee,et al.  Tempest: a substrate for portable parallel programs , 1995 .

[13]  Robert W. Pfile,et al.  Typhoon-Zero Implementation: The Vortex Module , 1995 .

[14]  James R. Larus,et al.  Efficient support for irregular applications on distributed-memory machines , 1995, PPOPP '95.

[15]  E. Felten,et al.  Contention and Queueing in an Experimental Multicomputer: Analytical and Simulation-based Results , 1996 .

[16]  James R. Larus,et al.  Teapot: language support for writing memory coherence protocols , 1996, PLDI '96.

[17]  Patrick W. Dowd,et al.  An FPGA-based coprocessor for ATM firewalls , 1997, Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186).

[18]  Angelos Bilas,et al.  Improving the performance of shared virtual memory on system area networks , 1998 .

[19]  Liviu Iftode,et al.  Monitoring shared virtual memory performance on a Myrinet-based PC cluster , 1998, ICS '98.