Cooperative checkpointing: a robust approach to large-scale systems reliability
暂无分享,去创建一个
[1] David Finkel,et al. Book review: The Art of Computer Systems Performance Analysis by R. Jain (Wiley-Interscience, 1991) , 1990, PERV.
[2] James S. Plank,et al. Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems , 2001, J. Parallel Distributed Comput..
[3] Anand Sivasubramaniam,et al. Critical event prediction for proactive management in large-scale computer clusters , 2003, KDD '03.
[4] Raj Jain,et al. The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.
[5] James S. Plank,et al. Experimental assessment of workstation failures and their impact on checkpointing systems , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).
[6] Steven J. Deitz,et al. Compiler support for automatic checkpointing , 2002, Proceedings 16th Annual International Symposium on High Performance Computing Systems and Applications.
[7] John W. Young,et al. A first order approximation to the optimum checkpoint interval , 1974, CACM.
[8] Ravishankar K. Iyer,et al. Error/failure analysis using event logs from fault tolerant systems , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.
[9] Thomas G. Dietterich,et al. Discovering Patterns in Sequences of Events , 1985, Artif. Intell..
[10] Jon Stearley,et al. Bad Words: Finding Faults in Spirit's Syslogs , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).
[11] Meeta Sharma Gupta,et al. Performance implications of periodic checkpointing on large-scale cluster systems , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.
[12] Jon Stearley,et al. What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).
[13] Sung-Eun Choi,et al. Compiler-generated staggered checkpointing , 2004 .
[14] Daniel Marques,et al. Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs , 2004, Proceedings of the ACM/IEEE SC2004 Conference.
[15] Willy Zwaenepoel,et al. The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.
[16] Anand Sivasubramaniam,et al. BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).
[17] Anand Sivasubramaniam,et al. Filtering failure logs for a BlueGene/L prototype , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).
[18] Ray Jain,et al. The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.
[19] E. N. Elnozahy,et al. Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery , 2004, IEEE Transactions on Dependable and Secure Computing.
[20] David F. Heidel,et al. An Overview of the BlueGene/L Supercomputer , 2002, ACM/IEEE SC 2002 Conference (SC'02).
[21] Ramendra K. Sahoo,et al. Evaluating cooperative checkpointing for supercomputing systems , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.
[22] Larry Rudolph,et al. Probabilistic QoS guarantees for supercomputing systems , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).
[23] Adam Jamison Oliner. Cooperative checkpointing for supercomputing systems , 2005 .
[24] Proceedings International Parallel and Distributed Processing Symposium , 2003, Proceedings International Parallel and Distributed Processing Symposium.
[25] Larry Rudolph,et al. Cooperative checkpointing theory , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.