论文信息 - A Checkpoint/Recovery Model for Heterogeneous Dataflow Computations Using Work-Stealing

A Checkpoint/Recovery Model for Heterogeneous Dataflow Computations Using Work-Stealing

This paper presents a new checkpoint/recovery method for dataflow computations using work-stealing in heterogeneous environments as found in grid or cluster computing. Basing the state of the computation on a dynamic macro dataflow graph, it is shown that the mechanisms provide effective checkpointing for multithreaded applications in heterogeneous environments. Two methods, Systematic Event Logging and Theft-Induced Checkpointing, are presented that are efficient and extremely flexible under the system-state model, allowing for recovery on different platforms under different number of processors. A formal analysis of the overhead induced by both methods is presented, followed by an experimental evaluation in a large cluster. It is shown that both methods have very small overhead and that trade-offs between checkpointing and recovery cost can be controlled.

[1] Leslie Lamport,et al. Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[2] Rémi Revire. Ordonnancement de graphe dynamique de tâches sur architecture de grande taille , 2004 .

[3] Laxmikant V. Kalé,et al. A fault tolerant protocol for massively parallel systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[4] Matteo Frigo,et al. The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[5] André Schiper,et al. A Systematic Classification of Replicated Database Protocols based on Atomic Broadcast , 1999 .

[6] Michael A. Bender,et al. Online Scheduling of Parallel Programs on Heterogeneous Systems with Applications to Cilk , 2002, SPAA '00.

[7] Miron Livny,et al. Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System , 1997 .

[8] Laxmikant V. Kalé,et al. FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[9] Georg Stellner,et al. CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[10] Gerson G. H. Cavalheiro,et al. Athapascan-1: On-line building data flow graph in a parallel language , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[11] B. Bouteiller,et al. MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[12] Theo Ungerer,et al. Asynchrony in Parallel Computing: From Dataflow to Multithreading , 2001, Scalable Comput. Pract. Exp..

[13] Robert E. Strom,et al. Optimistic recovery in distributed systems , 1985, TOCS.

[14] Sébastien Varrette,et al. Using data-flow analysis for resilience and result checking in peer-to-peer computations , 2004, Proceedings. 15th International Workshop on Database and Expert Systems Applications, 2004..

[15] Kishor S. Trivedi. Probability and Statistics with Reliability, Queuing, and Computer Science Applications , 1984 .

[16] Anh Nguyen-Tuong,et al. Exploiting data-flow for fault-tolerance in a wide-area parallel system , 1996, Proceedings 15th Symposium on Reliable Distributed Systems.

[17] Achour Mostéfaoui,et al. A communication-induced checkpointing protocol that ensures rollback-dependency trackability , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.

[18] Ronald L. Graham,et al. Bounds on Multiprocessing Timing Anomalies , 1969, SIAM Journal of Applied Mathematics.

[19] Lorenzo Alvisi,et al. Reasons for a pessimistic or optimistic message logging protocol in MPI uncoordinated failure, recovery , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.