A Hybrid Message Logging-CIC Protocol for Constrained Checkpointability

Communication Induced Checkpointing protocols usually make the assumption that any process can be checkpointed at any time. We propose an alternative approach which releases the constraint of always checkpointable processes, without delaying any message reception nor altering message ordering enforced by the communication layer or by the application. This protocol has been implemented within ProActive, an open source Java middleware for asynchronous and distributed objects implementing the ASP (Asynchronous Sequential Processes) model.

[1]  Daniel Marques,et al.  Automated application-level checkpointing of MPI programs , 2003, PPoPP '03.

[2]  Augusto Ciuffoletti,et al.  A Distributed Domino-Effect free recovery Algorithm , 1984, Symposium on Reliability in Distributed Software and Database Systems.

[3]  Jean-Charles Fabre,et al.  Optimized Object State Checkpointing using Compile-Time Reflection , 2002 .

[4]  Denis Caromel,et al.  A Fault Tolerance protocol for ASP calculus: Design and Proof , 2004 .

[5]  Denis Caromel,et al.  Asynchronous and deterministic objects , 2004, POPL '04.

[6]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[7]  Vijay K. Garg,et al.  Optimistic recovery in multi-threaded distributed systems , 1999, Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems.

[8]  Volker Strumpen,et al.  Portable Checkpointing for Heterogenous Architectures , 1997, International Symposium on Fault-Tolerant Computing.

[9]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[10]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[11]  B. Ramkumar,et al.  Portable checkpointing for heterogeneous architectures , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.

[12]  D. Manivannan,et al.  A low-overhead recovery technique using quasi-synchronous checkpointing , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[13]  Denis Caromel,et al.  Towards seamless computing and metacomputing in Java , 1998 .

[14]  Sara Bouchenak,et al.  Pickling threads state in the Java system , 2000, Proceedings 33rd International Conference on Technology of Object-Oriented Languages and Systems TOOLS 33.

[15]  D. Manivannan,et al.  Quasi-Synchronous Checkpointing: Models, Characterization, and Classification , 1999, IEEE Trans. Parallel Distributed Syst..

[16]  Jon Howell,et al.  Straightforward Java Persistence Through Checkpointing , 1999 .

[17]  Lorenzo Alvisi,et al.  An analysis of communication induced checkpointing , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).