MPI-FM: High Performance MPI on Workstation Clusters

Despite the emergence of high speed LANs, the communication performance available to applications on workstation clusters still falls short of that available on MPPs. A new generation of efficient messaging layers is needed to take advantage of the hardware performance and to deliver it to the application level. Communication software is the key element in bridging the communication performance gap separating MPPs and workstation clusters. MPI-FM is a high performance implementation of Message Passing Interface (MPI) for networks of workstations connected with a Myrinet network, built on top of the Fast Messages (FM) library. Based on the FM version 1.1 released in Fall 1995, MPI-FM achieves a minimum one-way latency of 19 ?s and a peak bandwidth of 17.3 Mbyte/s with common MPI send and receive function calls. A direct comparison using published performance figures shows that MPI-FM running on SPARCstation 20 workstations connected with a relatively inexpensive Myrinet network outperforms the MPI implementations available on the IBM SP2 and the Cray T3D, both in latency and in bandwidth, for messages up to 2 kbyte in size. We describe the critical performance issues found in building a high level messaging library (MPI) on top of a low level messaging layer (FM), and the design solutions we adopted for them. One such issue was the direct and efficient support of common operations like adding and removing a header. Another was the exchange of critical information between the layers, like the location of the destination buffer. These two optimizations are both shown to be necessary, and their combination sufficient to achieve the aforementioned level of performance. The performance contribution of each of these optimizations is examined in some detail. These results delineate a new design approach for low level communication layers in which a closer integration with the upper layer and an appropriate balance of the communication pipeline stages are the key elements for high performance.

[1]  Henri E. Bal,et al.  Orca: A Language For Parallel Programming of Distributed Systems , 1992, IEEE Trans. Software Eng..

[2]  David D. Clark,et al.  The structuring of systems using upcalls , 1985, SOSP '85.

[3]  Ruby B. Lee,et al.  Tempest: a substrate for portable parallel programs , 1995 .

[4]  Charles L. Seitz,et al.  The design of the Caltech Mosaic C multicomputer , 1993 .

[5]  Charles L. Seitz,et al.  Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.

[6]  Andrew A. Chien,et al.  Software overhead in messaging layers: where does the time go? , 1994, ASPLOS VI.

[7]  Samuel J. Leffler,et al.  The design and implementation of the 4.3 BSD Unix operating system , 1991, Addison-Wesley series in computer science.

[8]  Scott B. Marovich,et al.  Hamlyn: a high-performance network interface with sender-based memory management , 1995 .

[9]  Carl Ebeling,et al.  CRANIUM: An Interface for Message Passing on Adaptive Packet Routing Networks , 1994, PCRCW.

[10]  David H. C. Du,et al.  Distributed network computing over local ATM networks , 1994, Proceedings of Supercomputing '94.

[11]  Seth Copen Goldstein,et al.  Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.

[12]  Vaidy S. Sunderam,et al.  PVM: A Framework for Parallel Distributed Computing , 1990, Concurr. Pract. Exp..

[13]  Rolf Hempel,et al.  The MPI Message Passing Interface Standard , 1994 .

[14]  R. S. Cornelius,et al.  High-performance switching with fibre channel , 1992, Digest of Papers COMPCON Spring 1992.

[15]  James R. Larus,et al.  Tempest: a substrate for portable parallel programs , 1995, Digest of Papers. COMPCON'95. Technologies for the Information Superhighway.

[16]  Laxmikant V. Kalé,et al.  Converse: an interoperable framework for parallel programming , 1996, Proceedings of International Conference on Parallel Processing.

[17]  Hubertus Franke,et al.  MPI programming environment for IBM SP1/SP2 , 1995, Proceedings of 15th International Conference on Distributed Computing Systems.

[18]  A. Chien,et al.  High Performance Messaging on Workstations: Illinois Fast Messages (FM) for Myrinet , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[19]  Craig B. Stunkel,et al.  The SP1 high-performance switch , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[20]  Seth Copen Goldstein,et al.  Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.

[21]  Thorsten von Eicken,et al.  U-Net: a user-level network interface for parallel and distributed computing , 1995, SOSP.