LA-MPI : The Design and Implementation of a Network-Fault-Tolerant MPI for Terascale Clusters

In this paper we discuss unique architectural elements of the Los Alamos Message Passing Interface (LA-MPI). LAMPI is a high-performance, network fault-tolerant, threadsafe MPI library designed for terascale clusters that are inherently unreliable due to their sheer number of system components and inherent trade-offs between cost and performance. We examine in detail the design concepts used to implement LA-MPI. These include reliability features, such as application-level checksumming, message retransmission, and automatic message re-routing. Other key performance enhancing features, such as concurrent message routing over multiple, diverse network adapters and protocols, and communication-specific optimizations (e.g., shared memory) are examined. Email: lampi-support@lanl.gov. Los Alamos report LAUR-03-0939. Los Alamos National Laboratory is operated by the University of California for the National Nuclear Security Administration of the United States Department of Energy under contract W-7405-ENG-36. Project support was provided through ASCI/PSE and the Los Alamos Computer Science Institute.

[1]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[2]  Craig Partridge,et al.  Performance of checksums and CRCs over real data , 1995, SIGCOMM '95.

[3]  Georg Stellner,et al.  CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[4]  Rajeev Thakur,et al.  Users guide for ROMIO: A high-performance, portable MPI-IO implementation , 1997 .

[5]  Jack J. Dongarra,et al.  Scalable Networked Information Processing Environment (SNIPE) , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[6]  Roy Friedman,et al.  Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[7]  Jack J. Dongarra,et al.  FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World , 2000, PVM/MPI.

[8]  Wu-chun Feng,et al.  The Quadrics Network: High-Performance Clustering Technology , 2002, IEEE Micro.