Impact of Over-Decomposition on Coordinated Checkpoint/Rollback Protocol

Failure free execution will become rare in the future exascale computers. Thus, fault tolerance is now an active field of research. In this paper, we study the impact of decomposing an application in much more parallelism that the physical parallelism on the rollback step of fault tolerant coordinated protocols. This over-decomposition gives the runtime a better opportunity to balance workload after failure without the need of spare nodes, while preserving performance. We show that the overhead on normal execution remains low for relevant factor of over-decomposition. With over-decomposition, restart execution on the remaining nodes after failures shows very good performance compared to classic decomposition approach: our experiments show that the execution time after restart can be reduced by 42 %. We also consider a partial restart protocol to reduce the amount of lost work in case of failure by tracking the task dependencies inside processes. In some cases and thanks to over-decomposition, this partial restart time can represent only 54 % of the global restart time.

[1]  Laxmikant V. Kalé,et al.  A fault tolerant protocol for massively parallel systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[2]  Franck Cappello,et al.  Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[3]  Axel W. Krings,et al.  Flexible Rollback Recovery in Dynamic Heterogeneous Grid Computing , 2009, IEEE Transactions on Dependable and Secure Computing.

[4]  Sayantan Sur,et al.  Unifying UPC and MPI runtimes: experience with MVAPICH , 2010, PGAS '10.

[5]  Claudia Leopold,et al.  Parallel and distributed computing , 2000 .

[6]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[7]  Mark S. Squillante,et al.  Processor Allocation in Multiprogrammed Distributed-Memory Parallel Computer Systems , 1997, J. Parallel Distributed Comput..

[8]  Andrew Lumsdaine,et al.  The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[9]  Thierry Gautier,et al.  Optimised Recovery with a Coordinated Checkpoint/Rollback Protocol for Domain Decomposition Applications , 2008, MCO.

[10]  Dhabaleswar K. Panda,et al.  Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand , 2006, 2006 International Conference on Parallel Processing (ICPP'06).

[11]  Yuval Tamir,et al.  ERROR RECOVERY IN MULTICOMPUTERS USING GLOBAL CHECKPOINTS , 1984 .

[12]  Georg Hager,et al.  Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[13]  Jesús Labarta,et al.  Parallelizing dense and banded linear algebra libraries using SMPSs , 2009, Concurr. Comput. Pract. Exp..

[14]  Thierry Gautier,et al.  KAAPI: A thread scheduling runtime system for data flow computations on cluster of multi-processors , 2007, PASCO '07.

[15]  Thomas Hérault,et al.  MPICH-V Project: A Multiprotocol Automatic Fault-Tolerant MPI , 2006, Int. J. High Perform. Comput. Appl..

[16]  L. Pigeon,et al.  Self-Adaptation of Parallel Applications in Heterogeneous and Dynamic Architectures , 2006, 2006 2nd International Conference on Information & Communication Technologies.

[17]  Brian Vinter,et al.  Using overdecomposition to overlap communication latencies with computation and take advantage of SMT processors , 2006, 2006 International Conference on Parallel Processing Workshops (ICPPW'06).

[18]  Laxmikant V. Kale,et al.  Charm++ and AMPI: Adaptive Runtime Strategies via Migratable Objects , 2009 .

[19]  Gerson G. H. Cavalheiro,et al.  Athapascan-1: On-line building data flow graph in a parallel language , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[20]  Laxmikant V. Kalé,et al.  FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[21]  Jack J. Dongarra,et al.  Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[22]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[23]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.