Distributed Diskless Checkpoint for Large Scale Systems
暂无分享,去创建一个
[1] Robert B. Ross,et al. Providing Efficient I/O Redundancy in MPI Environments , 2004, PVM/MPI.
[2] Lihao Xu,et al. Optimizing Cauchy Reed-Solomon Codes for Fault-Tolerant Network Storage Applications , 2006, Fifth IEEE International Symposium on Network Computing and Applications (NCA'06).
[3] John Bent,et al. PLFS: a checkpoint filesystem for parallel applications , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.
[4] Bianca Schroeder,et al. A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.
[5] Kai Li,et al. Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..
[6] Kai Li,et al. Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.
[7] Jason Duell,et al. The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..
[8] Rong Zeng,et al. The Design and Implementation of , 2002 .
[9] J. Duell. The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .
[10] Satoshi Matsuoka. The Road to TSUBAME and Beyond , 2008 .
[11] Ahmed Al-Nazer,et al. On Disk-based and Diskless Checkpointing for Parallel and Distributed Systems: An Empirical Analysis , 2005 .
[12] Eric Roman. A Survey of Checkpoint / Restart Implementations , 2002 .
[13] IEEE Transactions on Parallel and Distributed Systems, Vol. 13 , 2002 .
[14] Zizhong Chen,et al. Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.
[15] Jason Duell,et al. The design and implementation of Berkeley Lab's linuxcheckpoint/restart , 2005 .
[16] G Bronevetsky,et al. Scalable I/O Systems via Node-Local Storage: Approaching 1 TB/sec File I/O , 2009 .
[17] Dhabaleswar K. Panda,et al. Group-based Coordinated Checkpointing for MPI: A Case Study on InfiniBand , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).
[18] Charng-da Lu,et al. Scalable Diskless Checkpointing for Large Parallel Systems , 2005 .
[19] Franck Cappello,et al. Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities , 2009, Int. J. High Perform. Comput. Appl..
[20] Zizhong Chen,et al. A Scalable Checkpoint Encoding Algorithm for Diskless Checkpointing , 2008, 2008 11th IEEE High Assurance Systems Engineering Symposium.
[21] Eduardo Pinheiro,et al. DRAM errors in the wild , 2011, Commun. ACM.
[22] Anthony Skjellum,et al. Accelerating Reed-Solomon coding in RAID systems with GPUs , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.
[23] Bianca Schroeder,et al. Understanding failures in petascale computers , 2007 .
[24] Yuan Xie,et al. Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.
[25] Catherine D. Schuman,et al. A Performance Evaluation and Examination of Open-Source Erasure Coding Libraries for Storage , 2009, FAST.