A Case Study of Incremental and Background Hybrid In-Memory Checkpointing
暂无分享,去创建一个
Norman P. Jouppi | Yuan Xie | Naveen Muralimanohar | Xiangyu Dong | N. Muralimanohar | N. Jouppi | Xiangyu Dong | Yuan Xie
[1] Hyun-Wook Jin,et al. High performance MPI-2 one-sided communication over InfiniBand , 2004, IEEE International Symposium on Cluster Computing and the Grid, 2004. CCGrid 2004..
[2] Jeffrey F. Naughton,et al. Low-Latency, Concurrent Checkpointing for Parallel Programs , 1994, IEEE Trans. Parallel Distributed Syst..
[3] Shih-Hung Chen,et al. Phase-change random access memory: A scalable technology , 2008, IBM J. Res. Dev..
[4] W. Kent Fuchs,et al. Checkpoint Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems , 1995, IEEE Trans. Parallel Distributed Syst..
[5] Miron Livny,et al. Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System , 1997 .
[6] Seetharami R. Seelam,et al. Modeling the Impact of Checkpoints on Next-Generation Systems , 2007, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007).
[7] Yuan Xie,et al. Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.
[8] Brian Randell,et al. System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.
[9] Yookun Cho,et al. Space-efficient page-level incremental checkpointing , 2005, SAC '05.
[10] Hua Zhong,et al. CRAK: Linux Checkpoint/Restart As a Kernel Module , 1996 .
[11] Leslie Lamport,et al. Distributed snapshots: determining global states of distributed systems , 1985, TOCS.
[12] James S. Plank,et al. Improving the performance of coordinated checkpointers on networks of workstations using RAID techniques , 1996, Proceedings 15th Symposium on Reliable Distributed Systems.
[13] Song Jiang,et al. Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers , 2005, ACM/IEEE SC 2005 Conference (SC'05).
[14] Nitin H. Vaidya,et al. A case for two-level distributed recovery schemes , 1995, SIGMETRICS '95/PERFORMANCE '95.
[15] Kai Li,et al. Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.
[16] Willy Zwaenepoel,et al. The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.
[17] Kai Li,et al. Memory Exclusion: Optimizing the Performance of Checkpointing Systems , 1999, Softw. Pract. Exp..
[18] Daniel Marques,et al. C3: A System for Automating Application-Level Checkpointing of MPI Programs , 2003, LCPC.
[19] J. Duell. The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .