CPPC: a compiler‐assisted tool for portable checkpointing of message‐passing applications

With the evolution of high‐performance computing toward heterogeneous, massively parallel systems, parallel applications have developed new checkpoint and restart necessities. Whether due to a failure in the execution or to a migration of the application processes to different machines, checkpointing tools must be able to operate in heterogeneous environments. However, some of the data manipulated by a parallel application are not truly portable. Examples of these include opaque state (e.g. data structures for communications support) or diversity of interfaces for a single feature (e.g. communications, I/O). Directly manipulating the underlying ad hoc representations renders checkpointing tools unable to work on different environments. Portable checkpointers usually work around portability issues at the cost of transparency: the user must provide information such as what data need to be stored, where to store them, or where to checkpoint. CPPC (ComPiler for Portable Checkpointing) is a checkpointing tool designed to feature both portability and transparency. It is made up of a library and a compiler. The CPPC library contains routines for variable level checkpointing, using portable code and protocols. The CPPC compiler helps to achieve transparency by relieving the user from time‐consuming tasks, such as data flow and communications analyses and adding instrumentation code. This paper covers both the operation of the CPPC library and its compiler support. Experimental results using benchmarks and large‐scale real applications are included, demonstrating usability, efficiency, and portability. Copyright © 2009 John Wiley & Sons, Ltd.

[1]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[2]  Nitin H. Vaidya,et al.  Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme , 1997, IEEE Trans. Computers.

[3]  Georg Stellner,et al.  CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[4]  Tomás F. Pena,et al.  Dual BEM for crack growth analysis on distributed-memory multiprocessors , 2000 .

[5]  Adrianos Lachanas,et al.  MPI-FT: Portable Fault Tolerance Scheme for MPI , 2000, Parallel Process. Lett..

[6]  Bülent Sankur,et al.  Survey over image thresholding techniques and quantitative performance evaluation , 2004, J. Electronic Imaging.

[7]  B. Ramkumar,et al.  Portable checkpointing for heterogeneous architectures , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.

[8]  Gabriel Rodríguez,et al.  Scalable Computing: Practice and Experience , 2008 .

[9]  E. N. Elnozahy,et al.  Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery , 2004, IEEE Transactions on Dependable and Secure Computing.

[10]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[11]  Harrick M. Vin,et al.  Egida: an extensible toolkit for low-overhead fault-tolerance , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[12]  W. Kent Fuchs,et al.  Compiler‐assisted full checkpointing , 1994, Softw. Pract. Exp..

[13]  Rudolf Eigenmann,et al.  Cetus - An Extensible Compiler Infrastructure for Source-to-Source Transformation , 2003, LCPC.

[14]  Volker Strumpen,et al.  Portable and fault-tolerant software systems , 1998, IEEE Micro.

[15]  Roy Friedman,et al.  Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[16]  Özalp Babaoglu,et al.  On the Optimum Checkpoint Selection Problem , 1984, SIAM J. Comput..

[17]  Gabriel Rodríguez,et al.  Compiler-assisted checkpointing of message-passing applications in heterogeneous environments , 2008 .

[18]  Micah Beck,et al.  Compiler-Assisted Memory Exclusion for Fast Checkpointing , 1995 .

[19]  Keshav Pingali,et al.  Mobile MPI programs in computational grids , 2006, PPoPP '06.

[20]  B. Bouteiller,et al.  MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[21]  Hewijin Christine Jiau,et al.  Process Recovery in Heterogeneous Systems , 2003, IEEE Trans. Computers.

[22]  Gabriel Rodríguez,et al.  Controller/Precompiler for Portable Checkpointing , 2006, IEICE Trans. Inf. Syst..

[23]  Daniel Marques,et al.  C3: A System for Automating Application-Level Checkpointing of MPI Programs , 2003, LCPC.

[24]  Michel Raynal,et al.  Consistency Issues in Distributed Checkpoints , 1999, IEEE Trans. Software Eng..

[25]  Javier D. Bruguera,et al.  High performance air pollution modeling for a power plant environment , 2003, Parallel Comput..

[26]  Nitin H. Vaidya,et al.  Staggered Consistent Checkpointing , 1999, IEEE Trans. Parallel Distributed Syst..

[27]  Song Jiang,et al.  Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers , 2005, ACM/IEEE SC 2005 Conference (SC'05).