Hybrid full/incremental checkpoint/restart for MPI jobs in HPC environments
暂无分享,去创建一个
[1] Andrew Lumsdaine,et al. The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.
[2] Philip S. Yu,et al. Toward Predictive Failure Management for Distributed Stream Processing Systems , 2008, 2008 The 28th International Conference on Distributed Computing Systems.
[3] RosenblumMendel,et al. The design and implementation of a log-structured file system , 1991 .
[4] Jason Duell,et al. The design and implementation of Berkeley Lab's linuxcheckpoint/restart , 2005 .
[5] Laxmikant V. Kale,et al. Proactive Fault Tolerance in Large Systems , 2004 .
[6] Yookun Cho,et al. Space-Efficient Page-Level Incremental Checkpointing , 2006 .
[7] Scott R. Kohn,et al. Large scale parallel structured AMR calculations using the SAMRAI framework , 2001, SC.
[8] Chao Wang,et al. Scalable, fault tolerant membership for MPI tasks on HPC systems , 2006, ICS '06.
[9] John W. Young,et al. A first order approximation to the optimum checkpoint interval , 1974, CACM.
[10] J. Duell. The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .
[11] Yookun Cho,et al. Adaptive page-level incremental checkpointing based on expected recovery time , 2006, SAC '06.
[12] Georg Stellner,et al. CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.
[13] Laxmikant V. Kalé,et al. A Fault Tolerance Protocol with Fast Fault Recovery , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.
[14] A.M. Wissink,et al. Large Scale Parallel Structured AMR Calculations Using the SAMRAI Framework , 2001, ACM/IEEE SC 2001 Conference (SC'01).
[15] A. Lumsdaine,et al. A Checkpoint and Restart Service Specification for Open MPI , 2006 .
[16] Anand Sivasubramaniam,et al. Critical event prediction for proactive management in large-scale computer clusters , 2003, KDD '03.
[17] Rajeev Thakur,et al. A Meta-Learning Failure Predictor for Blue Gene/L Systems , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).
[18] Chao Wang,et al. A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.
[19] Anand Sivasubramaniam,et al. Fault-aware job scheduling for BlueGene/L systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..
[20] Stephen L. Scott,et al. A reliability-aware approach for an optimal checkpoint/restart model in HPC environments , 2007, 2007 IEEE International Conference on Cluster Computing.
[21] Barton P. Miller,et al. Process migration in DEMOS/MP , 1983, SOSP '83.
[22] Willy Zwaenepoel,et al. Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.
[23] Mendel Rosenblum,et al. The design and implementation of a log-structured file system , 1991, SOSP '91.
[24] Laxmikant V. Kalé,et al. Proactive Fault Tolerance in MPI Applications Via Task Migration , 2006, HiPC.
[25] Wu-chun Feng,et al. A Power-Aware Run-Time System for High-Performance Computing , 2005, ACM/IEEE SC 2005 Conference (SC'05).
[26] George Bosilca,et al. Analysis of the Component Architecture Overhead in Open MPI , 2005, PVM/MPI.
[27] Jason Duell,et al. The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..
[28] Ruei-Chuan Chang,et al. Continuous Checkpointing: Joining the Checkpointing with Virtual Memory Paging , 1997, Softw. Pract. Exp..
[29] Stephen L. Scott,et al. Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).
[30] Christian Engelmann,et al. Proactive process-level live migration in HPC environments , 2008, HiPC 2008.
[31] Thomas Hérault,et al. MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).
[32] B. Bouteiller,et al. MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging , 2003, ACM/IEEE SC 2003 Conference (SC'03).
[33] John Daly. A Model for Predicting the Optimum Checkpoint Interval for Restart Dumps , 2003, International Conference on Computational Science.
[34] Fred Douglis,et al. Transparent process migration: Design alternatives and the sprite implementation , 1991, Softw. Pract. Exp..
[35] Song Jiang,et al. Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers , 2005, ACM/IEEE SC 2005 Conference (SC'05).
[36] John T. Daly,et al. A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..
[37] Andrew Lumsdaine,et al. A Component Architecture for LAM/MPI , 2003, PVM/MPI.
[38] Remzi H. Arpaci-Dusseau,et al. Architectural Requirements and Scalability of the NAS Parallel Benchmarks , 1999, ACM/IEEE SC 1999 Conference (SC'99).
[39] Dejan S. Milojicic,et al. Process migration , 1999, ACM Comput. Surv..