论文信息 - Cooperative checkpointing: a robust approach to large-scale systems reliability

Cooperative checkpointing: a robust approach to large-scale systems reliability

Cooperative checkpointing increases the performance and robustness of a system by allowing checkpoints requested by applications to be dynamically skipped at runtime. A robust system must be more than merely resilient to failures; it must be adaptable and flexible in the face of new and evolving challenges. A simulation-based experimental analysis using both probabilistic and harvested failure distributions reveals that cooperative checkpointing enables an application to make progress under a wide variety of failure distributions that periodic checkpointing lacks the flexibility to handle. Cooperative checkpointing can be easily implemented on top of existing application-initiated checkpointing mechanisms and may be used to enhance other reliability techniques like QoS guarantees and fault-aware job scheduling. The simulations also support a number of theoretical predictions related to cooperative checkpointing, including the non-competitiveness of periodic checkpointing.

Larry Rudolph | Ramendra K. Sahoo | Adam J. Oliner

[1] David Finkel,et al. Book review: The Art of Computer Systems Performance Analysis by R. Jain (Wiley-Interscience, 1991) , 1990, PERV.

[2] James S. Plank,et al. Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems , 2001, J. Parallel Distributed Comput..

[3] Anand Sivasubramaniam,et al. Critical event prediction for proactive management in large-scale computer clusters , 2003, KDD '03.

[4] Raj Jain,et al. The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[5] James S. Plank,et al. Experimental assessment of workstation failures and their impact on checkpointing systems , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[6] Steven J. Deitz,et al. Compiler support for automatic checkpointing , 2002, Proceedings 16th Annual International Symposium on High Performance Computing Systems and Applications.

[7] John W. Young,et al. A first order approximation to the optimum checkpoint interval , 1974, CACM.

[8] Ravishankar K. Iyer,et al. Error/failure analysis using event logs from fault tolerant systems , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[9] Thomas G. Dietterich,et al. Discovering Patterns in Sequences of Events , 1985, Artif. Intell..

[10] Jon Stearley,et al. Bad Words: Finding Faults in Spirit's Syslogs , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[11] Meeta Sharma Gupta,et al. Performance implications of periodic checkpointing on large-scale cluster systems , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[12] Jon Stearley,et al. What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[13] Sung-Eun Choi,et al. Compiler-generated staggered checkpointing , 2004 .

[14] Daniel Marques,et al. Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[15] Willy Zwaenepoel,et al. The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[16] Anand Sivasubramaniam,et al. BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[17] Anand Sivasubramaniam,et al. Filtering failure logs for a BlueGene/L prototype , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[18] Ray Jain,et al. The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[19] E. N. Elnozahy,et al. Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery , 2004, IEEE Transactions on Dependable and Secure Computing.

[20] David F. Heidel,et al. An Overview of the BlueGene/L Supercomputer , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[21] Ramendra K. Sahoo,et al. Evaluating cooperative checkpointing for supercomputing systems , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[22] Larry Rudolph,et al. Probabilistic QoS guarantees for supercomputing systems , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[23] Adam Jamison Oliner. Cooperative checkpointing for supercomputing systems , 2005 .

[24] Proceedings International Parallel and Distributed Processing Symposium , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[25] Larry Rudolph,et al. Cooperative checkpointing theory , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.