Fault-tolerance for macro dataflow parallel computations on grid

We present a portable fault tolerant mechanism for execution of macro dataflow parallel programs on a large scale distributed and heterogeneous grid including SMP nodes. Our mechanism is based on a portable checkpoint-rollback and supports both parallel programs with dependencies and addition or resilience of heterogeneous resources. We have implemented this mechanism on top of Athapascan programming interface and experimental results are presented.

[1]  Erik Maehle,et al.  Fault-Tolerant Dynamic Task Scheduling Based on Dataflow Graphs , 1998 .

[2]  Michael A. Bender,et al.  Online Scheduling of Parallel Programs on Heterogeneous Systems with Applications to Cilk , 2002, SPAA '00.

[3]  Francine Berman,et al.  Overview of the Book: Grid Computing – Making the Global Infrastructure a Reality , 2003 .

[4]  Miron Livny,et al.  Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System , 1997 .

[5]  Robert D. Blumofe,et al.  Scheduling multithreaded computations by work stealing , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[6]  C. Siva Ram Murthy,et al.  A Fault-Tolerant Dynamic Scheduling Algorithm for Multiprocessor Real-Time Systems and Its Analysis , 1998, IEEE Trans. Parallel Distributed Syst..

[7]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[8]  Gerson G. H. Cavalheiro,et al.  Athapascan-1: On-line building data flow graph in a parallel language , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[9]  Harrick M. Vin,et al.  Egida: an extensible toolkit for low-overhead fault-tolerance , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[10]  Lorenzo Alvisi,et al.  Reasons for a pessimistic or optimistic message logging protocol in MPI uncoordinated failure, recovery , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[11]  Peter Steenkiste,et al.  Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery , 1993 .

[12]  Partha Dasgupta,et al.  Distributed Cactus Stacks: Runtime Stack-Sharing Support for Distributed Parallel Programs , 1998 .

[13]  Thomas Hérault,et al.  MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).