A Lightweight Message Logging Scheme for Fault Tolerant MPI

This paper presents a new lightweight logging scheme for MPI to provide fault tolerance. Checkpointing recovery is the most widely used fault tolerance scheme for the distributed systems. However, all the processes should be rolled back and restarted even for a single process failure to preserve consistency. Message logging can be used so that the other processes can proceed unaffected by the failure. However, logging all the messages tends to be prohivitively expensive. We note that the applications programmed using MPI follow certain rules and not all of the messages need to be logged. Our logging scheme is based on this observation and only the absolutely necessary information is logged or piggybacked. As a result, it is possible to greately reduce the logging overhead using our scheme and the experimental results matched well with the expectation.

[1]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[2]  Adrianos Lachanas,et al.  MPI-FT: Portable Fault Tolerance Scheme for MPI , 2000, Parallel Process. Lett..

[3]  Barton P. Miller,et al.  Optimal tracing and replay for debugging message-passing parallel programs , 1992, Proceedings Supercomputing '92.

[4]  Ian T. Foster,et al.  The Globus project: a status report , 1998, Proceedings Seventh Heterogeneous Computing Workshop (HCW'98).

[5]  Thomas Hérault,et al.  MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[6]  Alan L. Cox,et al.  Lazy release consistency for software distributed shared memory , 1992, ISCA '92.

[7]  Jack J. Dongarra,et al.  FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World , 2000, PVM/MPI.

[8]  Heon Young Yeom,et al.  MPICH-GF: Transparent Checkpointing and Rollback-Recovery for Grid-Enabled MPI Processes , 2004, IEICE Trans. Inf. Syst..

[9]  Heon Young Yeom,et al.  A causal logging scheme for lazy release consistent distributed shared memory systems , 1998, Proceedings 1998 International Conference on Parallel and Distributed Systems (Cat. No.98TB100250).

[10]  Heon Young Yeom,et al.  An efficient causal logging scheme for recoverable distributed shared memory systems , 2002, Parallel Comput..

[11]  Lorenzo Alvisi,et al.  Nonblocking and orphan-free message logging protocols , 1992, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[12]  Georg Stellner,et al.  CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[13]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[14]  Anthony Skjellum,et al.  MPI/FT/sup TM/: architecture and taxonomies for fault-tolerant, message-passing middleware for performance-portable parallel computing , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[15]  Roy Friedman,et al.  Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).