Compiler-generated staggered checkpointing

To minimize work lost due to system failures, large parallel applications perform periodic checkpoints. These checkpoints are typically inserted manually by application programmers, resulting in synchronous checkpoints, or checkpoints that occur at the same program point in all processes. While this solution is tenable for current systems, it will become problematic for future supercomputers that have many tens of thousands of nodes, because contention for both the network and file system will grow. This paper shows that staggered checkpoints---globally consistent checkpoints in which processes perform checkpoints at different points in the code---can significantly reduce network and file system contention. We describe a compiler-based approach for inserting staggered checkpoints, and we show, using trace-driven simulation, that staggered checkpointing is 23 times faster that synchronous checkpointing.

[1]  James S. Plank Efficient checkpointing on MIMD architectures , 1993 .

[2]  Peter B. Ladkin,et al.  Interpreting Message Flow Graphs , 1995, Formal Aspects of Computing.

[3]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[4]  Steven J. Deitz,et al.  Compiler support for automatic checkpointing , 2002, Proceedings 16th Annual International Symposium on High Performance Computing Systems and Applications.

[5]  Peter B. Ladkin,et al.  Interpreting Message Flow , 1995 .

[6]  Peter B. Ladkin,et al.  Compile-time analysis of communicating processes , 1992, ICS '92.

[7]  Calvin Lin,et al.  Broadway: A Software Architecture for Scientific Computing , 2000, The Architecture of Scientific Software.

[8]  Nitin H. Vaidya On staggered checkpointing , 1996, Proceedings of SPDP '96: 8th IEEE Symposium on Parallel and Distributed Processing.

[9]  Daniel Marques,et al.  C3: A System for Automating Application-Level Checkpointing of MPI Programs , 2003, LCPC.

[10]  Micah Beck,et al.  Compiler-Assisted Checkpointing , 1994 .

[11]  Ozalp Babaoglu,et al.  Consistent global states of distributed systems: fundamental concepts and mechanisms , 1993 .