Scaling MPI to short-memory MPPs such as BG/L

Scalability to large number of processes is one of the weaknesses of current MPI implementations. Standard implementations are able to scale to hundreds of nodes, but not beyond. The main problem in these implementations is that they assume some resources (for both data and control-data) will always be available to receive/process unexpected messages. As we will show, this is not always true, especially in short-memory machines like the BG/L that has 64K nodes but each node only has 512Mbytes of memory.The objective of this paper is to present one algorithm that improves the robustness of MPI implementations for short-memory MPPs, taking care of data and control-data reception, the system will scale up to any number of nodes. The proposed solution achieves this goal without any observable overhead when there are no memory problems. Furthermore, in the worst case, when memory resources are extremely scarce, the overhead will never double the execution time (and we should never forget that in this extreme situation, traditional MPI implementations would fail to execute).

[1]  D. Panda,et al.  Implementing efficient and scalable flow control schemes in MPI over InfiniBand , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[2]  Nikitas J. Dimopoulos,et al.  Efficient Communication Using Message Prediction for Cluster Multiprocessors , 2000, CANPC.

[3]  José E. Moreira,et al.  An Overview of the Blue Gene/L System Software Organization , 2003, Euro-Par.

[4]  David H. Bailey,et al.  The NAS parallel benchmarks summary and preliminary results , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[5]  Jesús Labarta,et al.  A dynamic periodicity detector: application to speedup computation , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[6]  William Gropp,et al.  Implementing MPI on the BlueGene/L Supercomputer , 2004, Euro-Par.

[7]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[8]  Larry L. Peterson,et al.  The x-Kernel: An Architecture for Implementing Network Protocols , 1991, IEEE Trans. Software Eng..

[9]  Giulio Iannello,et al.  A Scalable Flow Control Algorithm for the Fast Messages Communication Library , 1999, CANPC.

[10]  Seth Copen Goldstein,et al.  Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.

[11]  Jesús Labarta,et al.  Exploring the predictability of MPI messages , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[12]  Thorsten von Eicken,et al.  Incorporating Memory Management into User-Level Network Interfaces , 1997 .

[13]  H. T. Kung,et al.  Credit-based flow control for ATM networks: credit update protocol, adaptive credit allocation and statistical multiplexing , 1994, SIGCOMM.

[14]  David F. Heidel,et al.  An Overview of the BlueGene/L Supercomputer , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[15]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[16]  Rolf Riesen,et al.  Design, Implementation, and Performance of MPI on Portals 3.0 , 2003, Int. J. High Perform. Comput. Appl..

[17]  Seth Copen Goldstein,et al.  Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.

[18]  Thorsten von Eicken,et al.  U-Net: a user-level network interface for parallel and distributed computing , 1995, SOSP.

[19]  Scott Pakin,et al.  High Performance Messaging on Workstations: Illinois Fast Messages (FM) for Myrinet , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[20]  A. Chien,et al.  High Performance Messaging on Workstations: Illinois Fast Messages (FM) for Myrinet , 1995, Proceedings of the IEEE/ACM SC95 Conference.