Application-level checkpointing for shared memory programs

Trends in high-performance computing are making it necessary for long-running applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart (CPR) - the state of the computation is saved periodically on disk, and when a failure occurs, the computation is restarted from the last saved state. At present, it is the responsibility of the programmer to instrument applications for CPR.Our group is investigating the use of compiler technology to instrument codes to make them self-checkpointing and self-restarting, thereby providing an automatic solution to the problem of making long-running scientific applications resilient to hardware faults. Our previous work focused on message-passing programs.In this paper, we describe such a system for shared-memory programs running on symmetric multiprocessors. This system has two components: (i) a pre-compiler for source-to-source modification of applications, and (ii) a runtime system that implements a protocol for coordinating CPR among the threads of the parallel application. For the sake of concreteness, we focus on a non-trivial subset of OpenMP that includes barriers and locks.One of the advantages of this approach is that the ability to tolerate faults becomes embedded within the application itself, so applications become self-checkpointing and self-restarting on any platform. We demonstrate this by showing that our transformed benchmarks can checkpoint and restart on three different platforms (Windows/x86, Linux/x86, and Tru64/Alpha). Our experiments show that the overhead introduced by this approach is usually quite small; they also suggest ways in which the current implementation can be tuned to reduced overheads further.

[1]  Jason Duell,et al.  The design and implementation of Berkeley Lab's linuxcheckpoint/restart , 2005 .

[2]  J. Duell The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .

[3]  Constantine Katsinis,et al.  Fault-Tolerant Distributed-Shared-Memory on a Broadcast-Based Interconnection Network , 2000, IPDPS Workshops.

[4]  BeguelinAdam,et al.  Application Level Fault Tolerance in Heterogeneous Networks of Workstations , 1997 .

[5]  Miron Livny,et al.  Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System , 1997 .

[6]  Daniel Marques,et al.  Collective operations in application-level fault-tolerant MPI , 2003, ICS '03.

[7]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[8]  Milo M. K. Martin,et al.  SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[9]  Angelos Bilas,et al.  Dynamic data replication: an approach to providing fault-tolerant shared memory clusters , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[10]  Nathan Stone A Checkpoint and Recovery System for the Pittsburgh Supercomputing Center Terascale Computing System , 2001 .

[11]  Alan L. Cox,et al.  TreadMarks: shared memory computing on networks of workstations , 1996 .

[12]  Seif Haridi,et al.  Distributed Algorithms , 1992, Lecture Notes in Computer Science.

[13]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[14]  Micah Beck,et al.  Compiler-Assisted Checkpointing , 1994 .

[15]  Josep Torrellas,et al.  ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors , 2002, ISCA.

[16]  Mitsuhisa Sato,et al.  Design of OpenMP Compiler for an SMP Cluster , 1999 .

[17]  Liviu Iftode,et al.  Scalable Fault-Tolerant Distributed Shared Memory , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[18]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[19]  Nian-Feng Tzeng,et al.  Coherence-based coordinated checkpointing for software distributed shared memory systems , 2000, Proceedings 20th IEEE International Conference on Distributed Computing Systems.

[20]  Georg Stellner,et al.  CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[21]  Miguel Castro,et al.  Distributed shared object memory , 1993, Proceedings of IEEE 4th Workshop on Workstation Operating Systems. WWOS-III.

[22]  Daniel Marques,et al.  Collective Operations in an Application-level Fault Tolerant MPI System , 2003 .

[23]  Miguel Castro,et al.  A checkpoint protocol for an entry consistent shared memory system , 1994, PODC '94.

[24]  Daniel Marques,et al.  Automated application-level checkpointing of MPI programs , 2003, PPoPP '03.

[25]  William R. Dieter,et al.  A user-level checkpointing library for POSIX threads programs , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).