Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers
暂无分享,去创建一个
Song Jiang | Fabrizio Petrini | Roberto Gioiosa | José Carlos Sancho | Song Jiang | F. Petrini | R. Gioiosa | J. Sancho
[1] Willy Zwaenepoel,et al. The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.
[2] A. Goscinski,et al. Exploiting operating system services to efficiently checkpoint parallel applications in GENESIS , 2002, Fifth International Conference on Algorithms and Architectures for Parallel Processing, 2002. Proceedings..
[3] Hua Zhong,et al. CRAK: Linux Checkpoint/Restart As a Kernel Module , 1996 .
[4] Jason Nieh,et al. Proceedings of the 5th Symposium on Operating Systems Design and Implementation , 2022 .
[5] Fabrizio Petrini,et al. System-level fault-tolerance in large-scale parallel machines with buffered coscheduling , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..
[6] Daniel Marques,et al. Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs , 2004, Proceedings of the ACM/IEEE SC2004 Conference.
[7] J. Duell. The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .
[8] Jason Duell,et al. The design and implementation of Berkeley Lab's linuxcheckpoint/restart , 2005 .
[9] Fabrizio Petrini,et al. Designing Parallel Operating Systems via Parallel Programming , 2004, Euro-Par.
[10] Dror G. Feitelson,et al. User-level communication in a system with gang scheduling , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.
[11] Fabrizio Petrini,et al. Architectural support for system software on large-scale clusters , 2004, International Conference on Parallel Processing, 2004. ICPP 2004..
[12] Fabrizio Petrini,et al. Predictive Performance and Scalability Modeling of a Large-Scale Application , 2001, ACM/IEEE SC 2001 Conference (SC'01).
[13] Amnon Barak,et al. The MOSIX multicomputer operating system for high performance cluster computing , 1998, Future Gener. Comput. Syst..
[14] Peter K. Szwed,et al. Application-level checkpointing for shared memory programs , 2004, ASPLOS XI.
[15] Kai Li,et al. Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.
[16] Fabrizio Petrini,et al. On the feasibility of incremental checkpointing for scientific computing , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..
[17] Barton P. Miller,et al. Process migration in DEMOS/MP , 1983, SOSP '83.
[18] Jason Duell,et al. The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..
[19] Jason Duell,et al. Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters , 2006 .
[20] Erik A. Hendriks,et al. BProc: the Beowulf distributed process space , 2002, ICS '02.
[21] Fabrizio Petrini,et al. BCS-MPI: A New Approach in the System Software Design for Large-Scale Parallel Computers , 2003, SC.
[22] David F. Heidel,et al. An Overview of the BlueGene/L Supercomputer , 2002, ACM/IEEE SC 2002 Conference (SC'02).
[23] Wu-chun Feng,et al. IMPROVED RESOURCE UTILIZATION WITH BUFFERED COSCHEDULING , 2001, Parallel Algorithms Appl..
[24] Scott Pakin,et al. STORM: Lightning-Fast Resource Management , 2002, ACM/IEEE SC 2002 Conference (SC'02).
[25] F. Petrini,et al. BCS-MPI: A New Approach in the System Software Design for Large-Scale Parallel Computers , 2003, ACM/IEEE SC 2003 Conference (SC'03).
[26] David H. Bailey,et al. The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..