Comparing Checkpoint and Rollback Recovery Schemes in a Cluster System
暂无分享,去创建一个
[1] Erol Gelenbe,et al. Performance of rollback recovery systems under intermittent failures , 1978, CACM.
[2] James S. Plank,et al. Improving the performance of coordinated checkpointers on networks of workstations using RAID techniques , 1996, Proceedings 15th Symposium on Reliable Distributed Systems.
[3] Lars Lundberg,et al. Optimal recovery schemes in fault tolerant distributed computing , 2005, Acta Informatica.
[4] John T. Daly,et al. Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters , 2010, HPDC '10.
[5] Stephen L. Scott,et al. Evaluation of fault-tolerant policies using simulation , 2007, 2007 IEEE International Conference on Cluster Computing.
[6] Nitin H. Vaidya,et al. Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme , 1997, IEEE Trans. Computers.
[7] Bharat K. Bhargava,et al. Independent checkpointing and concurrent rollback for recovery in distributed systems-an optimistic approach , 1988, Proceedings [1988] Seventh Symposium on Reliable Distributed Systems.
[8] Stephen L. Scott,et al. An optimal checkpoint/restart model for a large scale high performance computing system , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.
[9] Erol Gelenbe,et al. On the Optimum Checkpoint Interval , 1979, JACM.
[10] Hiroshi Nakamura,et al. Skewed checkpointing for tolerating multi-node failures , 2004, Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004..
[11] W. Kent Fuchs,et al. Optimistic message logging for independent checkpointing in message-passing systems , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.
[12] Erol Gelenbe,et al. Dependable execution of distributed programs , 1995, Simul. Pract. Theory.
[13] Michael R. Lyu. Software Fault Tolerance , 1995 .
[14] Stephen L. Scott,et al. A reliability-aware approach for an optimal checkpoint/restart model in HPC environments , 2007, 2007 IEEE International Conference on Cluster Computing.
[15] Stephen L. Scott,et al. Reliability-aware Checkpoint/Restart Scheme: A Performability Trade-off , 2005, 2005 IEEE International Conference on Cluster Computing.
[16] Christine Morin,et al. Handling Persistent States in Process Checkpoint/Restart Mechanisms for HPC Systems , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.
[17] Nitin H. Vaidya,et al. A case for two-level distributed recovery schemes , 1995, SIGMETRICS '95/PERFORMANCE '95.
[18] Tadashi Dohi,et al. Numerical computation algorithms for sequential checkpoint placement , 2009, Perform. Evaluation.
[19] Mark A. Franklin,et al. Checkpointing in Distributed Computing Systems , 1996, J. Parallel Distributed Comput..
[20] James S. Plank,et al. Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems , 2001, J. Parallel Distributed Comput..
[21] Stephen L. Scott,et al. Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).
[22] Nitin H. Vaidya,et al. A Case for Two-Level Recovery Schemes , 1998, IEEE Trans. Computers.
[23] Jean-Marc Vincent,et al. A Flexible Checkpoint/Restart Model in Distributed Systems , 2009, PPAM.
[24] Wei-Tek Tsai,et al. A low overhead checkpointing and rollback recovery scheme for distributed systems , 1989, Proceedings of the Eighth Symposium on Reliable Distributed Systems.
[25] Satish K. Tripathi,et al. Availability of a distributed computer system with failures , 2004, Acta Informatica.