Modular Checkpointing for Atomicity

Transient faults that arise in large-scale software systems can often be repaired by re-executing the code in which they occur. Ascribing a meaningful semantics for safe re-execution in multi-threaded code is not obvious, however. For a thread to correctly re-execute a region of code, it must ensure that all other threads which have witnessed its unwanted effects within that region are also reverted to a meaningful earlier state. If not done properly, data inconsistencies and other undesirable behavior may result. However, automatically determining what constitutes a consistent global checkpoint is not straightforward since thread interactions are a dynamic property of the program. In this paper, we present a safe and efficient checkpointing mechanism for Concurrent ML (CML) that can be used to recover from transient faults. We introduce a new linguistic abstraction called stabilizers that permits the specification of per-thread monitors and the restoration of globally consistent checkpoints. Global states are computed through lightweight monitoring of communication events among threads (e.g. message-passing operations or updates to shared variables). Our checkpointing abstraction provides atomicity and isolation guarantees during state restoration ensuring restored global states are safe. Our experimental results on several realistic, multithreaded, server-style CML applications, including a web server and a windowing toolkit, show that the overheads to use stabilizers are small, and lead us to conclude that they are a viable mechanism for defining safe checkpoints in concurrent functional programs. Our experiments conclude with a case study illustrating how to build open nested transactions from our checkpointing mechanism.

[1]  Computer Staff,et al.  Transaction processing , 1994 .

[2]  E. B. Moss,et al.  Nested Transactions: An Approach to Reliable Distributed Computing , 1985 .

[3]  Dan Grossman,et al.  AtomCaml: first-class atomicity via rollback , 2005, ICFP '05.

[4]  Michael F. P. O'Boyle,et al.  Adaptive java optimisation using instance-based learning , 2004, ICS '04.

[5]  Carlos A. Varela,et al.  Transactors: a programming model for maintaining globally consistent distributed state in unreliable environments , 2005, POPL '05.

[6]  Adam Welc,et al.  Preemption-based avoidance of priority inversion for Java , 2004 .

[7]  George Candea,et al.  Microreboot - A Technique for Cheap Recovery , 2004, OSDI.

[8]  Dennis W. Duke,et al.  Proceedings of the 1998 ACM/IEEE conference on Supercomputing , 1998 .

[9]  Maurice Herlihy,et al.  Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[10]  Simon L. Peyton Jones,et al.  Composable memory transactions , 2005, CACM.

[11]  Jan Vitek,et al.  A transactional object calculus , 2005, Sci. Comput. Program..

[12]  Peter K. Szwed,et al.  Application-level checkpointing for shared memory programs , 2004, ASPLOS XI.

[13]  Andrew W. Appel,et al.  Debuggable concurrency extensions for standard ML , 1991, PADD '91.

[14]  Kai Li,et al.  CLIP: A Checkpointing Tool for Message Passing Parallel Programs , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[15]  Chita R. Das,et al.  Selective checkpointing and rollbacks in multithreaded distributed systems , 2001, Proceedings 21st International Conference on Distributed Computing Systems.

[16]  R. Sarnath,et al.  Proceedings of the International Conference on Parallel Processing , 1992 .

[17]  J. Gregory Morrisett,et al.  Composing first-class transactions , 1994, TOPL.

[18]  Maurice Herlihy,et al.  Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[19]  Suresh Jagannathan,et al.  Transactional Monitors for Concurrent Objects , 2004, ECOOP.

[20]  Roberto Bruni,et al.  Theoretical foundations for compensations in flow composition languages , 2005, POPL '05.

[21]  Frank Huch,et al.  Searching for deadlocks while debugging concurrent haskell programs , 2004, ICFP '04.

[22]  Suresh Jagannathan,et al.  Safe futures for Java , 2005, OOPSLA '05.

[23]  Jan Vitek,et al.  Optimistic Concurrency Semantics for Transactions in Coordination Languages , 2004, COORDINATION.

[24]  William R. Dieter,et al.  A user-level checkpointing library for POSIX threads programs , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[25]  A. Elmagarmid Database transaction models for advanced applications , 1992 .

[26]  Alan Dearle,et al.  On page-based optimistic process checkpointing , 1995, Proceedings of International Workshop on Object Orientation in Operating Systems.

[27]  Robert Bruce Findler,et al.  Kill-safe synchronization abstractions , 2004, PLDI '04.

[28]  Andrew W. Appel,et al.  Debugging standard ML without reverse engineering , 1990, LISP and Functional Programming.

[29]  Micah Beck,et al.  Compiler-Assisted Checkpointing , 1994 .

[30]  Daniel Marques,et al.  Automated application-level checkpointing of MPI programs , 2003, PPoPP '03.

[31]  Jeffrey F. Naughton,et al.  Real-time, concurrent checkpoint for parallel programs , 1990, PPOPP '90.

[32]  Panos K. Chrysanthis,et al.  ACTA: The SAGA Continues , 1992, Database Transaction Models for Advanced Applications.

[33]  J. Eliot B. Moss Open Nested Transactions: Semantics and Support , 2006 .

[34]  John H. Reppy,et al.  Concurrent programming in ML , 1999 .

[35]  J. T. Robinson,et al.  On optimistic methods for concurrency control , 1979, TODS.

[36]  Hans-Jörg Schek,et al.  Concepts and Applications of Multilevel Transactions and Open Nested Transactions , 1992, Database Transaction Models for Advanced Applications.

[37]  H. T. Kung,et al.  On optimistic concurrency control , 1981 .

[38]  Robert Gruber,et al.  Efficient optimistic concurrency control using loosely synchronized clocks , 1995, SIGMOD '95.

[39]  Robert Bruce Findler,et al.  Kill-Safe Synchronization Abstractions "Well, it just so happens that your friend here is only mostly dead. There's a big difference between mostly dead and all dead ." - Miracle Max in The Princess Bride , 2004 .

[40]  Matthew Fluet,et al.  Transactional events , 2006, ICFP '06.

[41]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[42]  Maurice Herlihy,et al.  Software transactional memory for dynamic-sized data structures , 2003, PODC '03.