A survey and review of the current state of rollback‐recovery for cluster systems
暂无分享,去创建一个
[1] Kai Li,et al. Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..
[2] J. Duell. The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .
[3] Andrzej M. Goscinski,et al. Toward Self Discovery for an Autonomic Cluster , 2005, ICA3PP.
[4] Kai Li,et al. Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.
[5] Robert E. Strom,et al. Optimistic recovery in distributed systems , 1985, TOCS.
[6] Leslie Lamport,et al. The Byzantine Generals Problem , 1982, TOPL.
[7] V. Rajaraman,et al. A survey of checkpointing algorithms for parallel and distributed computers , 2000 .
[8] D. Manivannan,et al. Finding Consistent Global Checkpoints in a Distributed Computation , 1997, IEEE Trans. Parallel Distributed Syst..
[9] David B. Johnson,et al. Efficient transparent optimistic rollback recovery for distributed application programs , 1993, Proceedings of 1993 IEEE 12th Symposium on Reliable Distributed Systems.
[10] Augusto Ciuffoletti,et al. A Distributed Domino-Effect free recovery Algorithm , 1984, Symposium on Reliability in Distributed Software and Database Systems.
[11] Willy Zwaenepoel,et al. Manetho: fault tolerance in distributed systems using rollback-recovery and process replication , 1994 .
[12] Ravishankar K. Iyer,et al. Modeling coordinated checkpointing for large-scale supercomputers , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).
[13] Rudy Lauwereins,et al. User-triggered checkpointing: system-independent and scalable application recovery , 1997, Proceedings Second IEEE Symposium on Computer and Communications.
[14] Jason Duell,et al. The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..
[15] C. R. Landau. The checkpoint mechanism in KeyKOS , 1992, [1992] Proceedings of the Second International Workshop on Object Orientation in Operating Systems.
[16] Daniel Marques,et al. Recent advances in checkpoint/recovery systems , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.
[17] Flaviu Cristian,et al. Understanding fault-tolerant distributed systems , 1991, CACM.
[18] Christine Morin,et al. Checkpointing and recovery of shared memory parallel applications in a cluster , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..
[19] R.E. Strom,et al. A recoverable object store , 1988, [1988] Proceedings of the Twenty-First Annual Hawaii International Conference on System Sciences. Volume II: Software track.
[20] Andrzej M. Goscinski,et al. Towards an operating system managing parallelism of computing on clusters , 2000, Future Gener. Comput. Syst..
[21] Darin Anderson. Providing Fault Tolerance in Distributed Systems , 2000 .
[22] Jack Dongarra,et al. PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing , 1995 .
[23] Andrzej M. Goscinski,et al. Transparent and Autonomic Rollback-Recovery in Cluster Systems , 2008, 2008 14th IEEE International Conference on Parallel and Distributed Systems.
[24] Yi-Min Wang,et al. Why optimistic message logging has not been used in telecommunications systems , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.
[25] Wolfgang Graetsch,et al. Fault tolerance under UNIX , 1989, TOCS.
[26] Willy Zwaenepoel,et al. Fault Tolerance for a Workstation Cluster , 1993 .
[27] Angelos Bilas,et al. Fast and transparent recovery for continuous availability of cluster-based servers , 2006, PPoPP '06.
[28] Andrzej M. Goscinski,et al. The development of an efficient checkpointing facility exploiting operating systems services of the GENESIS cluster operating system , 2004, Future Gener. Comput. Syst..
[29] Jochen Liedtke,et al. On micro-kernel construction , 1995, SOSP.
[30] Andrzej M. Goscinski,et al. Distributed operating systems - the logical design , 1991 .
[31] Sarmistha Neogy,et al. Selective recovery in distributed systems , 2004, 2004 IEEE Region 10 Conference TENCON 2004..
[32] Christine Morin,et al. Towards an efficient single system image cluster operating system , 2002, Fifth International Conference on Algorithms and Architectures for Parallel Processing, 2002. Proceedings..
[33] James S. Plank,et al. An Overview of Checkpointing in Uniprocessor and DistributedSystems, Focusing on Implementation and Performance , 1997 .
[34] Andrzej M. Goscinski,et al. GENESIS: an efficient, transparent and easy to use cluster operating system , 2002, Parallel Comput..
[35] Andrzej M. Goscinski,et al. The RHODOS migration facility , 1998, J. Syst. Softw..
[36] Nuno Neves,et al. RENEW: a tool for fast and efficient implementation of checkpoint protocols , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).
[37] Andrzej M. Goscinski,et al. A Cluster Operating System Supporting Parallel Computing , 2001, Cluster Computing.
[38] Michael Treaster,et al. A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems , 2004, ArXiv.
[39] Vijay K. Garg,et al. How to recover efficiently and asynchronously when optimism fails , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.
[40] Lorenzo Alvisi. Understanding the message logging paradigm for masking process crashes , 1996 .
[41] Andrzej M. Goscinski,et al. A Group Communications Facility for Reliable Computing on Clusters , 2001, ISCA PDCS.
[42] Brian Randell,et al. System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.