Design and Implementation of M ultiple Fault-Tolerant M PI over M yrinet ( M 3 ) ∗

Advances in network technology and computing power have inspired the emergence of high-performance cluster computing systems. While cluster management and hardware highavailability tools are readily available, practical and easily deployable fault-tolerant systems have not been successfully adopted commercially. We present a fault-tolerant system, Multiple fault-tolerant MPI over Myrinet (M), that differs in notable respects from other proposed fault-tolerant systems in the literature. M is built on top of Myrinet since it is regarded as one of the best solutions for highperformance networks and is widely used in cluster computing systems because it can provide a high-speed switching network that is an inevitable ingredient in interconnecting clusters of workstations or PCs. M is a user-transparent checkpointing system for multiple fault-tolerant MPI implementation that is primarily based on the coordinated checkpointing protocol. M supports three critical functionalities that are necessary for faulttolerance: a light-weight failure detection mechanism, dynamic process management that includes process migration, and a consistent checkpoint and recovery mechanism. The features of M are that it requires no modifications of application code and that it preserves much of the high performance characteristics of Myrinet. This paper describes the architecture of M, its detailed design principles and comprehensive implementation issues. We also propose practical solutions for those involved in constructing highly available cluster systems for parallel programming systems. Ex∗ c ©ACM, (2005). This is the author’s version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in Proceedings of ACM/IEEE SuperComputing (SC|05) Conference, Seattle, WA, USA. http://sc05.supercomputing.org/ Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SC|05November 12-18, 2005, Seattle, Washington, USA Copyright 2005 ACM 1-59593-061-2/05/0011 ... $5 00. perimental results substantiate our assertion that M can be a good candidate for practically deployable fault-tolerant systems in very-large and high-performance Myrinet clusters and that its protocol can be applied to a wide variety of parallel communication libraries without difficulty.

[1]  Lorenzo Alvisi Understanding the message logging paradigm for masking process crashes , 1996 .

[2]  Georg Stellner,et al.  CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[3]  Jyh-Jong Tsay,et al.  Checkpointing Message-Passing Interface (MPI) parallel programs , 1997, Proceedings Pacific Rim International Symposium on Fault-Tolerant Systems.

[4]  Nuno Neves,et al.  RENEW: a tool for fast and efficient implementation of checkpoint protocols , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[5]  L. Alvisi,et al.  Message Logging: Pessimistic, Optimistic, Causal, and Optimal , 1998, IEEE Trans. Software Eng..

[6]  Jonathan Robinson,et al.  The Hector Distributed Run-Time Environment , 1998, IEEE Trans. Parallel Distributed Syst..

[7]  Roy Friedman,et al.  Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[8]  Harrick M. Vin,et al.  Egida: an extensible toolkit for low-overhead fault-tolerance , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[9]  Andrew S. Grimshaw,et al.  Integrating fault-tolerance techniques in grid applications , 2000 .

[10]  Adrianos Lachanas,et al.  MPI-FT: Portable Fault Tolerance Scheme for MPI , 2000, Parallel Process. Lett..

[11]  Jack J. Dongarra,et al.  FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World , 2000, PVM/MPI.

[12]  Anthony Skjellum,et al.  MPI/FT/sup TM/: architecture and taxonomies for fault-tolerant, message-passing middleware for performance-portable parallel computing , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[13]  Dhiraj K. Pradhan,et al.  Roll-Forward and Rollback Recovery: Performance-Reliability Trade-Off , 1997, IEEE Trans. Computers.

[14]  Thomas Hérault,et al.  MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[15]  Heon Young Yeom,et al.  Design and Implementation of Dynamic Process Management for Grid-Enabled MPICH , 2003, PVM/MPI.

[16]  Dhabaleswar K. Panda,et al.  High Performance RDMA-Based MPI Implementation over InfiniBand , 2003, ICS '03.

[17]  Heon Y. Yeom,et al.  MPICH-GF: Providing Fault Tolerance on Grid Environments , 2003 .

[18]  B. Bouteiller,et al.  MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[19]  Heon Young Yeom,et al.  MPICH-GF: Transparent Checkpointing and Rollback-Recovery for Grid-Enabled MPI Processes , 2004, IEICE Trans. Inf. Syst..

[20]  Ian T. Foster,et al.  Condor-G: A Computation Management Agent for Multi-Institutional Grids , 2004, Cluster Computing.