Implementing MPI with Optimized Algorithms for Metacomputing

| This paper presents an implementation of the Message Passing Interface called PACX-MPI. The major goal of the library is to support heterogeneous metacomputing for MPI applications by clustering MPP's and PVP's. The key concept of the library is a daemon-concept. We will focus in this paper on two aspects of this library. First we will show the importance of the usage of optimized algorithms for the global operations in such a metacomputing environment. And second we want to discuss whether we introduce a bottleneck by using daemon-nodes for the external communication. Keywords|MPI, Metacomputing, Global Operations I. Why another MPI Implementation ? In the last couple of years a large number of tools and libraries have been developed to enable the coupling of computational resources, which may be distributed all over the world [5], [6], [8], [9], [12],[13]. The goal of such projects is usually to solve problems on a cluster of machines which cannot be solved by using a single Massively Parallel Processing System (MPP) or Parallel Vector Processors (PVP). PACX-MPI (PArallel Computer eXtension)[10] is an implementation of MPI which tries to meet the demand of distributed computing. While most vendor implemented libraries do not support interoperability between di erent MPI-libraries, PACX-MPI makes the MPI-calls available across di erent platforms. The Interoperable MPI approach (IMPI) [11] may solve this problem, but still this is no standard and thus there are not yet any MPIimplementations according to these speci cations. A lot of features as described in the IMPI draft document are however re ected in the PACX-MPI concept although it is also far from providing the full required functionality. MPICH [7] as the mainly used MPI distribution supports a lot of platforms and supports the coupling of machines, too. The major disadvantage of this implementation is, that one may run in di culties by coupling e.g. two machines with 512 nodes each, because of the number of open ports. In the worst case one may end up with 511 open ports on each node. The number of open ports is furthermore for all machines of importance, which are protected by some kind of rewalls. The less ports one has to use for the coupling of di erent resources, the easier is it to open and to control those few ports. Some other approaches to achieve interoperability have been made. PVMPI [5] makes MPI applications run on a cluster of machines by using PVM for the communication between the di erent machines. Unfortunately the user can use only point-to-point operations and he has to add some non MPI congruent calls. The subsequent project, MPI Connect uses the same ideas but replaced PVM by a library called SNIPE [6], and supports now global operations too, in contrary to PVMPI. A similar approach has been done by PLUS [2]. This library additionally supports communication between di erent message-passing libraries, like e.g. PARMACS, PVM and MPI. But again the user has to add some calls to his application. Another project called Stampi [12] has been recently presented. This project already uses the MPI2 process model, but focuses mainly on local area computing. Referring to the experiences of a lot of those e orts, this paper presents the concept and results that were achieved with PACX-MPI especially with respect to optimized communication algorithms. The concept of the paper is as follows. The second section describes the main concepts and the main ideas of PACX-MPI. In the third section we focus on optimizing global operations for metacomputing. Afterwards we discuss in the fourth section whether we introduce a bottleneck by using daemon nodes for the external communication. In the fth section we present some applications which used PACX-MPI during the Supercomputing'98 event in Orlando. Some optimization e orts are described there. In the last section we brie y describe the ongoing work and the future activities in this project. II. Concept of PACX-MPI Before we start to describe the concept of PACX-MPI, we have to de ne, for what kind of clusters we want to use it. With PACX-MPI we do not intend to cluster workstations or even small MPPs or PVPs to simulate a big parallel machine. Our goal is to couple big resources to simulate machines, which can be hardly build nowadays. This includes that these machines are usually not in the same computing center and therefore we have to deal with latencies between the machines, which are in a complete di erent range than the latencies inside a single machine. To couple di erent MPP's and PVP's, PACX-MPI has to distinguish between internal and external operations. Internal operations are executed by using the vendorimplemented MPI-library, since these are highly optimized. Furthermore this is nowadays the only protocol, which is accessible on each machine and which can exploit the full capabilities of the underlying network of an MPP. Therefore PACX-MPI can be described as an implementation of MPI on the top of the native MPI-libraries. External operations, e.g. point-to-point operations between two nodes on di erent machines, are handled by a di erent standard protocol. Actually PACX-MPI supports only TCP/IP, but we will add some other protocols like native ATM in the frame of a European project in the future. In this sense PACX-MPI can be described as a tool to provide multi-protocol MPI for metacomputing. To avoid that each node has to open ports if it wants to perform some external operations, PACX-MPI uses two specialized daemon nodes for the external communication on each machine. Using these daemon nodes we can minimize the number of open ports and we can use xed portnumbers. These two nodes are transparent for the application, and are therefore not part of global communicators, like e.g. MPI COMM WORLD. Figure 1 shows a con guration of two machines, each using 4 nodes for the application and how MPI COMM WORLD looks like in this example. On the left machine, which shall be the machine with the number one, the rst two node with ranks 0 and 1 are not part of MPI COMM WORLD, since these are the daemon nodes. The next node with the rank 2 is therefore the rst node in our global communicator and gets the global rank number 0. All other application nodes get a global number according to their local ranks minus two, the last node on this machine has the rank 3. On the next machine, the daemon nodes again are not considered in the global MPI COMM WORLD. The node with the local rank 3 is number 4 in the global communicator, since the numbering on this machine starts with the last global rank on the previous machine plus one.