A Distributed Scheme for Fault-Tolerance in Large Clusters of Workstations

c © 2006 by John von Neumann Institute for Computing Permission to make digital or hard copies of portions of this work for personal or classroom use is granted provided that the copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise requires prior specific permission by the publisher mentioned above.

[1]  Jason Duell,et al.  The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..

[2]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[3]  Franck Cappello,et al.  Coordinated checkpoint versus message log for fault tolerant MPI , 2004, Int. J. High Perform. Comput. Netw..

[4]  Adrianos Lachanas,et al.  MPI-FT: Portable Fault Tolerance Scheme for MPI , 2000, Parallel Process. Lett..

[5]  Thomas Hérault,et al.  MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[6]  Roy Friedman,et al.  Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[7]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[8]  Harrick M. Vin,et al.  Egida: an extensible toolkit for low-overhead fault-tolerance , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[9]  Jack J. Dongarra,et al.  FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World , 2000, PVM/MPI.

[10]  Georg Stellner,et al.  CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.