Flexible Rollback Recovery in Dynamic Heterogeneous Grid Computing
暂无分享,去创建一个
[1] S. Jafar,et al. Theft-induced checkpointing for reconfigurable dataflow applications , 2005, 2005 IEEE International Conference on Electro Information Technology.
[2] Leslie Lamport,et al. The Byzantine Generals Problem , 1982, TOPL.
[3] Niraj K. Jha,et al. Fault-tolerant computer system design , 1996, IEEE Parallel & Distributed Technology: Systems & Applications.
[4] Denis Caromel,et al. A Hybrid Message Logging-CIC Protocol for Constrained Checkpointability , 2005, Euro-Par.
[5] Franck Cappello,et al. Coordinated checkpoint versus message log for fault tolerant MPI , 2004, 2003 Proceedings IEEE International Conference on Cluster Computing.
[6] L. Alvisi,et al. A Survey of Rollback-Recovery Protocols , 2002 .
[7] Lorenzo Alvisi,et al. Reasons for a pessimistic or optimistic message logging protocol in MPI uncoordinated failure, recovery , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.
[8] Miron Livny,et al. Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System , 1997 .
[9] Dhiraj K. Pradhan,et al. Fault-tolerant computer system design , 1996 .
[10] Robert E. Strom,et al. Optimistic recovery in distributed systems , 1985, TOCS.
[11] Kishor S. Trivedi. Probability and Statistics with Reliability, Queuing, and Computer Science Applications , 1984 .
[12] B. Bouteiller,et al. MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging , 2003, ACM/IEEE SC 2003 Conference (SC'03).
[13] Matteo Frigo,et al. The implementation of the Cilk-5 multithreaded language , 1998, PLDI.
[14] Andrew S. Grimshaw,et al. Exploiting Data-Flow for Fault-Tolerance in a Wide-Area Parallel System , 1996, SRDS.
[15] Jason Maassen,et al. Fault-tolerance, malleability and migration for divide-and-conquer applications on the grid , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.
[16] Laxmikant V. Kalé,et al. FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).
[17] Theo Ungerer,et al. Asynchrony in Parallel Computing: From Dataflow to Multithreading , 2001, Scalable Comput. Pract. Exp..
[18] S. Jafar,et al. Certification of large distributed computations with task dependencies in hostile environments , 2005, 2005 IEEE International Conference on Electro Information Technology.
[19] Axel W. Krings,et al. A Checkpoint/Recovery Model for Heterogeneous Dataflow Computations Using Work-Stealing , 2005, Euro-Par.
[20] Philip M. Thambidurai,et al. Interactive consistency with multiple failure modes , 1988, Proceedings [1988] Seventh Symposium on Reliable Distributed Systems.
[21] Laxmikant V. Kalé,et al. A fault tolerant protocol for massively parallel systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..
[22] Volker Strumpen,et al. Portable and fault-tolerant software systems , 1998, IEEE Micro.
[23] Jeff T. Linderoth,et al. Solving large quadratic assignment problems on computational grids , 2002, Math. Program..
[24] Gerson G. H. Cavalheiro,et al. Athapascan-1: On-line building data flow graph in a parallel language , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).
[25] Achour Mostéfaoui,et al. A communication-induced checkpointing protocol that ensures rollback-dependency trackability , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.
[26] Thomas Hérault,et al. MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).
[27] Georg Stellner,et al. CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.
[28] Pradeep K. Khosla,et al. Selecting the Right Data Distribution Scheme for a Survivable Storage System (CMU-CS-01-120) , 2001 .
[29] Leslie Lamport,et al. Distributed snapshots: determining global states of distributed systems , 1985, TOCS.
[30] Axel W. Krings,et al. A Probabilistic Approach for Task and Result Certification of Large-Scale Distributed Applications in Hostile Environments , 2005, EGC.
[31] Randy H. Katz,et al. A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.
[32] Luis F. G. Sarmenta. Sabotage-tolerance mechanisms for volunteer computing systems , 2002, Future Gener. Comput. Syst..
[33] Brian Randell,et al. System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.