Cooperative checkpointing: a robust approach to large-scale systems reliability

Cooperative checkpointing increases the performance and robustness of a system by allowing checkpoints requested by applications to be dynamically skipped at runtime. A robust system must be more than merely resilient to failures; it must be adaptable and flexible in the face of new and evolving challenges. A simulation-based experimental analysis using both probabilistic and harvested failure distributions reveals that cooperative checkpointing enables an application to make progress under a wide variety of failure distributions that periodic checkpointing lacks the flexibility to handle. Cooperative checkpointing can be easily implemented on top of existing application-initiated checkpointing mechanisms and may be used to enhance other reliability techniques like QoS guarantees and fault-aware job scheduling. The simulations also support a number of theoretical predictions related to cooperative checkpointing, including the non-competitiveness of periodic checkpointing.

[1]  David Finkel,et al.  Book review: The Art of Computer Systems Performance Analysis by R. Jain (Wiley-Interscience, 1991) , 1990, PERV.

[2]  James S. Plank,et al.  Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems , 2001, J. Parallel Distributed Comput..

[3]  Anand Sivasubramaniam,et al.  Critical event prediction for proactive management in large-scale computer clusters , 2003, KDD '03.

[4]  Raj Jain,et al.  The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[5]  James S. Plank,et al.  Experimental assessment of workstation failures and their impact on checkpointing systems , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[6]  Steven J. Deitz,et al.  Compiler support for automatic checkpointing , 2002, Proceedings 16th Annual International Symposium on High Performance Computing Systems and Applications.

[7]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[8]  Ravishankar K. Iyer,et al.  Error/failure analysis using event logs from fault tolerant systems , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[9]  Thomas G. Dietterich,et al.  Discovering Patterns in Sequences of Events , 1985, Artif. Intell..

[10]  Jon Stearley,et al.  Bad Words: Finding Faults in Spirit's Syslogs , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[11]  Meeta Sharma Gupta,et al.  Performance implications of periodic checkpointing on large-scale cluster systems , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[12]  Jon Stearley,et al.  What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[13]  Sung-Eun Choi,et al.  Compiler-generated staggered checkpointing , 2004 .

[14]  Daniel Marques,et al.  Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[15]  Willy Zwaenepoel,et al.  The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[16]  Anand Sivasubramaniam,et al.  BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[17]  Anand Sivasubramaniam,et al.  Filtering failure logs for a BlueGene/L prototype , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[18]  Ray Jain,et al.  The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[19]  E. N. Elnozahy,et al.  Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery , 2004, IEEE Transactions on Dependable and Secure Computing.

[20]  David F. Heidel,et al.  An Overview of the BlueGene/L Supercomputer , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[21]  Ramendra K. Sahoo,et al.  Evaluating cooperative checkpointing for supercomputing systems , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[22]  Larry Rudolph,et al.  Probabilistic QoS guarantees for supercomputing systems , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[23]  Adam Jamison Oliner Cooperative checkpointing for supercomputing systems , 2005 .

[24]  Proceedings International Parallel and Distributed Processing Symposium , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[25]  Larry Rudolph,et al.  Cooperative checkpointing theory , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.