Efficient Message Passing Interface (MPI) for Parallel Computing on Clusters of Workstations

Parallel computing on clusters of workstations and personal computers has very high potential, since it leverages existing hardware and software. Parallel programming environments offer the user a convenient way to express parallel computation and communication. In fact, recently, a Message Passing Interface (MPI) has been proposed as an industrial standard for writing “portable” message-passing parallel programs. The communication part of MPI consists of the usual point-to-point communication as well as collective communication. However, existing implementations of programming environments for clusters are built on top of a point-to-point communication layer (send and receive) over local area networks (LANs) and, as a result, suffer from poor performance in the collective communication part. In this paper, we present an efficient design and implementation of the collective communication part in MPI that is optimized for clusters of workstations. Our system consists of two main components: the MPI-CCL layer that includes the collective communication functionality of MPI and a User-Level Reliable Transport Protocol (URTP) that interfaces with the LAN Data-Link Layer and leverages the fact that the LAN is a broadcast medium. Our system is integrated with the operating system via an efficient kernel extension mechanism that we developed. The kernel extension significantly improves the performance of our implementation as it can handle part of the communication overhead without involving user space. We have implemented our system on a collection of IBM RS/6000 workstations connected via a 10-Mbit Ethernet LAN. Our performance measurements are taken from typical scientific programs that run in a parallel mode by means of the MPI. The hypothesis behind our design is that the system's performance will be bounded by interactions between the kernel and user space rather than by the bandwidth delivered by the LAN Data-Link Layer. Our results indicate that the performance of our MPI Broadcast (on top of Ethernet) is about twice as fast as a recently published software implementation of broadcast on top of ATM.

[1]  Brian Randell,et al.  Operating Systems, An Advanced Course , 1978 .

[2]  Willy Zwaenepoel,et al.  Distributed process groups in the V Kernel , 1985, TOCS.

[3]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1985, JACM.

[4]  Jeffrey C. Mogul,et al.  The packer filter: an efficient mechanism for user-level network code , 1987, SOSP '87.

[5]  Willy Zwaenepoel,et al.  Optimistic implementation of bulk data transfer protocols , 1989, SIGMETRICS '89.

[6]  Geoffrey Brown,et al.  Block acknowledgement: redesigning the window protocol , 1989, SIGCOMM 1989.

[7]  Henri E. Bal,et al.  Parallel programming using shared objects and broadcasting , 1992, Computer.

[8]  P. Carnevali,et al.  Adaptive solution strategy for solving large systems of p‐type finite element equations , 1992 .

[9]  Yair Amir,et al.  Transis: a communication subsystem for high availability , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[10]  Hubertus Franke,et al.  MPI-F: An Efficient Implementation of MPI on IBM-SP1 , 1994, 1994 International Conference on Parallel Processing Vol. 3.

[11]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[12]  Philip K. McKinley,et al.  Design and implementation of multicast operations for ATM-based high performance computing , 1994, Proceedings of Supercomputing '94.

[13]  Jehoshua Bruck,et al.  The IBM External User Interface for Scalable Parallel Systems , 1994, Parallel Comput..

[14]  Jehoshua Bruck,et al.  CCL: a portable and tunable collective communication library for scalable parallel computers , 1994, Proceedings of 8th International Parallel Processing Symposium.

[15]  David A. Patterson,et al.  A case for networks of workstations (now) , 1994, Symposium Record Hot Interconnects II.

[16]  Jehoshua Bruck,et al.  PCODE: an efficient and reliable collective communication protocol for unreliable broadcast domain , 1995, Proceedings of 9th International Parallel Processing Symposium.

[17]  Jack Dongarra,et al.  PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing , 1995 .

[18]  A. Chien,et al.  High Performance Messaging on Workstations: Illinois Fast Messages (FM) for Myrinet , 1995, Proceedings of the IEEE/ACM SC95 Conference.