MPICH-GF: Transparent Checkpointing and Rollback-Recovery for Grid-Enabled MPI Processes

Fault-tolerance is an essential element to the distributed system which requires the reliable computation environment. In spite of extensive researches over two decades, practical fault-tolerance systems have not been provided. It is due to the high overhead and the unhandiness of the previous fault-tolerance systems. In this paper, we propose MPICH-GF, a user-transparent checkpointing system for grid-enabled MPICH. Our objectives are to fill the gap between the theory and practice of fault-tolerance systems and to provide a checkpointing-recovery system for grids. To build a fault-tolerant MPICH version, we have designed task migration, dynamic process management for MPI and message queue management. MPICH-GF requires no modification of application source codes and affects the MPICH communication as less as possible. The features of MPICH-GF are that it supports the direct message transfer mode and that all of the implementation has been done at the lower layer, that is, the virtual device level. We have evaluated MPICH-GF with NPB applications on Globus middleware.

[1]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[2]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[3]  Ewing L. Lusk,et al.  Monitors, Messages, and Clusters: The p4 Parallel Programming System , 1994, Parallel Comput..

[4]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[5]  Forum Mpi MPI: A Message-Passing Interface , 1994 .

[6]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[7]  Jian Xu,et al.  Necessary and Sufficient Conditions for Consistent Global Snapshots , 1995, IEEE Trans. Parallel Distributed Syst..

[8]  Kai Li,et al.  Libckpt: Transparent Checkpointing under Unix Error Correction: Libckpt: Transparent Checkpointing under Unix , 1995 .

[9]  Georg Stellner,et al.  CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[10]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[11]  Jyh-Jong Tsay,et al.  Checkpointing Message-Passing Interface (MPI) parallel programs , 1997, Proceedings Pacific Rim International Symposium on Fault-Tolerant Systems.

[12]  Erik Seligman,et al.  Application Level Fault Tolerance in Heterogenous Networks of Workstations , 1997, J. Parallel Distributed Comput..

[13]  Kai Li,et al.  CLIP: A Checkpointing Tool for Message Passing Parallel Programs , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[14]  Ian T. Foster,et al.  A Grid-Enabled MPI: Message Passing in Heterogeneous Distributed Computing Systems , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[15]  Nuno Neves,et al.  RENEW: a tool for fast and efficient implementation of checkpoint protocols , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[16]  L. Alvisi,et al.  Message Logging: Pessimistic, Optimistic, Causal, and Optimal , 1998, IEEE Trans. Software Eng..

[17]  Luís Moura Silva,et al.  System-level versus user-defined checkpointing , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[18]  Kai Li,et al.  Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..

[19]  Jonathan Robinson,et al.  The Hector Distributed Run-Time Environment , 1998, IEEE Trans. Parallel Distributed Syst..

[20]  Ian T. Foster,et al.  The Globus project: a status report , 1998, Proceedings Seventh Heterogeneous Computing Workshop (HCW'98).

[21]  Sy-Yen Kuo,et al.  Theoretical Analysis for Communication-Induced Checkpointing Protocols with Rollback-Dependency Trackability , 1998, IEEE Trans. Parallel Distributed Syst..

[22]  Michael Litzkow,et al.  Supporting checkpointing and process migration outside the UNIX kernel , 1999 .

[23]  Lorenzo Alvisi,et al.  An analysis of communication induced checkpointing , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[24]  Wei-Jih Li,et al.  Checkpointing message passing interface (MPI) parallel programs , 1999 .

[25]  Roy Friedman,et al.  Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[26]  Miron Livny,et al.  Process hijacking , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[27]  Harrick M. Vin,et al.  Egida: an extensible toolkit for low-overhead fault-tolerance , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[28]  Andrew S. Grimshaw,et al.  Integrating fault-tolerance techniques in grid applications , 2000 .

[29]  Adrianos Lachanas,et al.  MPI-FT: Portable Fault Tolerance Scheme for MPI , 2000, Parallel Process. Lett..

[30]  William G. Tuel,et al.  Parallel checkpoint/restart without message logging , 2000, Proceedings 2000. International Workshop on Parallel Processing.

[31]  Jack J. Dongarra,et al.  FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World , 2000, PVM/MPI.

[32]  Anthony Skjellum,et al.  MPI/FT/sup TM/: architecture and taxonomies for fault-tolerant, message-passing middleware for performance-portable parallel computing , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[33]  Ian T. Foster,et al.  The Anatomy of the Grid: Enabling Scalable Virtual Organizations , 2001, Int. J. High Perform. Comput. Appl..

[34]  Greg Burns,et al.  LAM: An Open Cluster Environment for MPI , 2002 .

[35]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[36]  Viet D. Tran,et al.  Application Recovery in Parallel Programming Environment , 2002, PVM/MPI.

[37]  Thomas Hérault,et al.  MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[38]  Ian T. Foster,et al.  Condor-G: A Computation Management Agent for Multi-Institutional Grids , 2004, Cluster Computing.