Extending an Application-Level Checkpointing Tool to Provide Fault Tolerance Support to OpenMP Applications

Despite the increasing popularity of shared-memory systems, there is a lack of tools for providing fault tolerance support to shared-memory applications. CPPC (ComPiler for Portable Checkpointing) is an application-level checkpointing tool fo- cused on the insertion of fault tolerance into long-running MPI applications. This paper presents an extension to CPPC to allow the checkpointing of OpenMP applica- tions. The proposed solution maintains the main characteristics of CPPC: portability and reduced checkpoint file size. The performance of the proposal is evaluated using the OpenMP NAS Parallel Benchmarks showing that most of the applications present small checkpoint overheads.

[1]  Gabriel Rodríguez,et al.  A Heuristic Approach for the Automatic Insertion of Checkpoints in Message-Passing Codes , 2009, J. Univers. Comput. Sci..

[2]  Gabriel Rodríguez,et al.  Compiler-assisted checkpointing of message-passing applications in heterogeneous environments , 2008 .

[3]  Milo M. K. Martin,et al.  SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[4]  Kai Li,et al.  CLIP: A Checkpointing Tool for Message Passing Parallel Programs , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[5]  Heon Young Yeom,et al.  MPICH-GF: Transparent Checkpointing and Rollback-Recovery for Grid-Enabled MPI Processes , 2004, IEICE Trans. Inf. Syst..

[6]  B. Bouteiller,et al.  MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[7]  Gene Cooperman,et al.  DMTCP: Transparent checkpointing for cluster computations and the desktop , 2007, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[8]  Erik Seligman,et al.  Dome: Distributed Object Migration Environment , 1994 .

[9]  Gabriel Rodríguez,et al.  Analysis of Performance-impacting Factors on Checkpointing Frameworks: The CPPC Case Study , 2011, Comput. J..

[10]  Peter K. Szwed,et al.  Application-level checkpointing for shared memory programs , 2004, ASPLOS XI.

[11]  Sunil Ahn,et al.  PC/MPI: Desing and Implementation of a Portable MPI Checkpointer , 2003, PVM/MPI.

[12]  Keshav Pingali,et al.  Experimental evaluation of application-level checkpointing for OpenMP programs , 2006, ICS '06.

[13]  Mohamed Shawky,et al.  Using Dynamic Task Level Redundancy for OpenMP Fault Tolerance , 2012, ARCS.

[14]  Georg Stellner,et al.  CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[15]  William R. Dieter,et al.  A user-level checkpointing library for POSIX threads programs , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[16]  Josep Torrellas,et al.  ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors , 2002, ISCA.

[17]  Roberto R. Osorio,et al.  Improving Scalability of Application-Level Checkpoint-Recovery by Reducing Checkpoint Sizes , 2013, New Generation Computing.

[18]  Gabriel Rodríguez,et al.  CPPC: a compiler‐assisted tool for portable checkpointing of message‐passing applications , 2010, Concurr. Comput. Pract. Exp..

[19]  W. Walker,et al.  Mpi: a Standard Message Passing Interface 1 Mpi: a Standard Message Passing Interface , 1996 .

[20]  Yan Ding,et al.  Using Redundant Threads for Fault Tolerance of OpenMP Programs , 2010, 2010 International Conference on Information Science and Applications.