Flexible Rollback Recovery in Dynamic Heterogeneous Grid Computing

Large applications executing on Grid or cluster architectures consisting of hundreds or thousands of computational nodes create problems with respect to reliability. The source of the problems are node failures and the need for dynamic configuration over extensive run-time. This paper presents two fault-tolerance mechanisms called theft induced checkpointing and systematic event logging. These are transparent protocols capable of overcoming problems associated with both, benign faults, i.e., crash faults, and node or subnet volatility. Specifically, the protocols base the state of the execution on a dataflow graph, allowing for efficient recovery in dynamic heterogeneous systems as well as multi-threaded applications. By allowing recovery even under different numbers of processors, the approaches are especially suitable for applications with need for adaptive or reactionary configuration control. The low-cost protocols offer the capability of controlling or bounding the overhead. A formal cost model is presented, followed by an experimental evaluation. It is shown that the overhead of the protocol is very small and the maximum work lost by a crashed process is small and bounded.

[1]  S. Jafar,et al.  Theft-induced checkpointing for reconfigurable dataflow applications , 2005, 2005 IEEE International Conference on Electro Information Technology.

[2]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[3]  Niraj K. Jha,et al.  Fault-tolerant computer system design , 1996, IEEE Parallel & Distributed Technology: Systems & Applications.

[4]  Denis Caromel,et al.  A Hybrid Message Logging-CIC Protocol for Constrained Checkpointability , 2005, Euro-Par.

[5]  Franck Cappello,et al.  Coordinated checkpoint versus message log for fault tolerant MPI , 2004, 2003 Proceedings IEEE International Conference on Cluster Computing.

[6]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[7]  Lorenzo Alvisi,et al.  Reasons for a pessimistic or optimistic message logging protocol in MPI uncoordinated failure, recovery , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[8]  Miron Livny,et al.  Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System , 1997 .

[9]  Dhiraj K. Pradhan,et al.  Fault-tolerant computer system design , 1996 .

[10]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[11]  Kishor S. Trivedi Probability and Statistics with Reliability, Queuing, and Computer Science Applications , 1984 .

[12]  B. Bouteiller,et al.  MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[13]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[14]  Andrew S. Grimshaw,et al.  Exploiting Data-Flow for Fault-Tolerance in a Wide-Area Parallel System , 1996, SRDS.

[15]  Jason Maassen,et al.  Fault-tolerance, malleability and migration for divide-and-conquer applications on the grid , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[16]  Laxmikant V. Kalé,et al.  FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[17]  Theo Ungerer,et al.  Asynchrony in Parallel Computing: From Dataflow to Multithreading , 2001, Scalable Comput. Pract. Exp..

[18]  S. Jafar,et al.  Certification of large distributed computations with task dependencies in hostile environments , 2005, 2005 IEEE International Conference on Electro Information Technology.

[19]  Axel W. Krings,et al.  A Checkpoint/Recovery Model for Heterogeneous Dataflow Computations Using Work-Stealing , 2005, Euro-Par.

[20]  Philip M. Thambidurai,et al.  Interactive consistency with multiple failure modes , 1988, Proceedings [1988] Seventh Symposium on Reliable Distributed Systems.

[21]  Laxmikant V. Kalé,et al.  A fault tolerant protocol for massively parallel systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[22]  Volker Strumpen,et al.  Portable and fault-tolerant software systems , 1998, IEEE Micro.

[23]  Jeff T. Linderoth,et al.  Solving large quadratic assignment problems on computational grids , 2002, Math. Program..

[24]  Gerson G. H. Cavalheiro,et al.  Athapascan-1: On-line building data flow graph in a parallel language , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[25]  Achour Mostéfaoui,et al.  A communication-induced checkpointing protocol that ensures rollback-dependency trackability , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.

[26]  Thomas Hérault,et al.  MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[27]  Georg Stellner,et al.  CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[28]  Pradeep K. Khosla,et al.  Selecting the Right Data Distribution Scheme for a Survivable Storage System (CMU-CS-01-120) , 2001 .

[29]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[30]  Axel W. Krings,et al.  A Probabilistic Approach for Task and Result Certification of Large-Scale Distributed Applications in Hostile Environments , 2005, EGC.

[31]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[32]  Luis F. G. Sarmenta Sabotage-tolerance mechanisms for volunteer computing systems , 2002, Future Gener. Comput. Syst..

[33]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.