Design and Implementation of Multiple Fault-Tolerant MPI over Myrinet (M^3)

Advances in network technology and computing power have inspired the emergence of high-performance cluster computing systems. While cluster management and hardware highavailability tools are readily available, practical and easily deployable fault-tolerant systems have not been successfully adopted commercially. We present a fault-tolerant system, Multiple fault-tolerant MPI over Myrinet (M3), that differs in notable respects from other proposed fault-tolerant systems in the literature. M3 is built on top of Myrinet since it is regarded as one of the best solutions for highperformance networks and is widely used in cluster computing systems because it can provide a high-speed switching network that is an inevitable ingredient in interconnecting clusters of workstations or PCs. M^3 is a user-transparent checkpointing system for multiple fault-tolerant MPI implementation that is primarily based on the coordinated checkpointing protocol. M3 supports three critical functionalities that are necessary for faulttolerance: a light-weight failure detection mechanism, dynamic process management that includes process migration, and a consistent checkpoint and recovery mechanism. The features of M are that it requires no modifications of application code and that it preserves much of the high performance characteristics of Myrinet. This paper describes the architecture of M3, its detailed design principles and comprehensive implementation issues. We also propose practical solutions for those involved in constructing highly available cluster systems for parallel programming systems. Experimental results substantiate our assertion that M3 can be a good candidate for practically deployable fault-tolerant systems in very-large and high-performance Myrinet clusters and that its protocol can be applied to a wide variety of parallel communication libraries without difficulty.

[1]  Nuno Neves,et al.  RENEW: a tool for fast and efficient implementation of checkpoint protocols , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[2]  Harrick M. Vin,et al.  Egida: an extensible toolkit for low-overhead fault-tolerance , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[3]  Ian T. Foster,et al.  Condor-G: A Computation Management Agent for Multi-Institutional Grids , 2004, Cluster Computing.

[4]  Heon Young Yeom,et al.  Design and Implementation of Dynamic Process Management for Grid-Enabled MPICH , 2003, PVM/MPI.

[5]  Georg Stellner,et al.  CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[6]  Lorenzo Alvisi Understanding the message logging paradigm for masking process crashes , 1996 .

[7]  Jonathan Robinson,et al.  The Hector Distributed Run-Time Environment , 1998, IEEE Trans. Parallel Distributed Syst..

[8]  Adrianos Lachanas,et al.  MPI-FT: Portable Fault Tolerance Scheme for MPI , 2000, Parallel Process. Lett..

[9]  Thomas Hérault,et al.  MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[10]  Dhabaleswar K. Panda,et al.  High Performance RDMA-Based MPI Implementation over InfiniBand , 2003, ICS '03.

[11]  Heon Young Yeom,et al.  MPICH-GF: Transparent Checkpointing and Rollback-Recovery for Grid-Enabled MPI Processes , 2004, IEICE Trans. Inf. Syst..

[12]  Anthony Skjellum,et al.  MPI/FT/sup TM/: architecture and taxonomies for fault-tolerant, message-passing middleware for performance-portable parallel computing , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[13]  Roy Friedman,et al.  Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[14]  B. Bouteiller,et al.  MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[15]  Jyh-Jong Tsay,et al.  Checkpointing Message-Passing Interface (MPI) parallel programs , 1997, Proceedings Pacific Rim International Symposium on Fault-Tolerant Systems.

[16]  Dhabaleswar K. Panda,et al.  High performance RDMA-based MPI implementation over InfiniBand , 2003, ICS.

[17]  Lorenzo Alvisi,et al.  Reasons for a pessimistic or optimistic message logging protocol in MPI uncoordinated failure, recovery , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[18]  Jack J. Dongarra,et al.  FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World , 2000, PVM/MPI.

[19]  Heon Y. Yeom,et al.  MPICH-GF: Providing Fault Tolerance on Grid Environments , 2003 .

[20]  Andrew S. Grimshaw,et al.  Integrating fault-tolerance techniques in grid applications , 2000 .