Stabilizers: Safe Lightweight Check- pointing for Concurrent Programs

A checkpoint is a mechanism that allows program execution to be restarted from a previously saved state. Checkpoints can be used in conjunction with exception handling abstractions to recover from exceptional or erroneous events, to support debugging or replay mechanisms, or to facilitate algorithms that rely on speculative evaluation. While relatively straightforward in a sequential setting, for example through the capture and application of continuations, it is less clear how 6 ascribea meaningful semantics for lightweight and safe check~oints in the Dresence of concurrency. For a thread to correctly resume execution fmm a saved checkpoint, it must ensure that all other threads which have witnessed its unwanted effects after the checkpoint was established are also reverted to a meaningful earlier state. If this is not done, data inconsistencies and other undesirable behavior may result. However, automatically determining what constitutes a consistent global state is not straightforward since thread interactions are a dynamic property of the program; requiring applications to specify such states explicitly is not pragmatic if interactions are complex. In this paper, we present a safe and efficient on-the-fly checkpointing mechanism for concurrent programs. We introduce a new abstraction called stabilizers that permits the specification and restoration of globally consistent checkpoints. This state is computed through lightweight monitoring of communication events among threads (e.g., message-passing operations or updates to shared variables). Our implementation results show that the memory and computation overheads for using stabilizers on highlyconcurrent server applications is small, averaging roughly 4 to 6%, leading us to conclude that stabilizers are a viable abstraction for defining restorable checkpoint state in complex concurrent programs.

[1]  Robert H. B. Netzer,et al.  Optimal tracing and incremental reexecution for debugging long-running programs , 1994, PLDI '94.

[2]  Andrew W. Appel,et al.  Compiling with Continuations , 1991 .

[3]  Andrew W. Appel,et al.  Debuggable concurrency extensions for standard ML , 1991, PADD '91.

[4]  John H. Reppy,et al.  Concurrent programming in ML , 1999 .

[5]  J. T. Robinson,et al.  On optimistic methods for concurrency control , 1979, TODS.

[6]  Micah Beck,et al.  Compiler-Assisted Checkpointing , 1994 .

[7]  Robert Gruber,et al.  Efficient optimistic concurrency control using loosely synchronized clocks , 1995, SIGMOD '95.

[8]  Chita R. Das,et al.  Selective checkpointing and rollbacks in multithreaded distributed systems , 2001, Proceedings 21st International Conference on Distributed Computing Systems.

[9]  Andrew W. Appel,et al.  Space-efficient closure representations , 1994, LFP '94.

[10]  William R. Dieter,et al.  A user-level checkpointing library for POSIX threads programs , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[11]  Robert H. Halstead,et al.  MULTILISP: a language for concurrent symbolic computation , 1985, TOPL.

[12]  Kai Li,et al.  CLIP: A Checkpointing Tool for Message Passing Parallel Programs , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[13]  Computer Staff,et al.  Transaction processing , 1994 .

[14]  Frank Huch,et al.  Searching for deadlocks while debugging concurrent haskell programs , 2004, ICFP '04.

[15]  John Rosenberg,et al.  Operating system support for persistent and recoverable computations , 1996, CACM.

[16]  Suresh Jagannathan,et al.  Transactional Monitors for Concurrent Objects , 2004, ECOOP.

[17]  John H. Reppy,et al.  CML: A Higher-Order Concurrent Language , 1991, PLDI.

[18]  Panos K. Chrysanthis,et al.  ACTA: The SAGA Continues , 1992, Database Transaction Models for Advanced Applications.

[19]  Daniel Marques,et al.  Automated application-level checkpointing of MPI programs , 2003, PPoPP '03.

[20]  Jeffrey F. Naughton,et al.  Real-time, concurrent checkpoint for parallel programs , 1990, PPOPP '90.

[21]  Matthias Felleisen,et al.  The Semantics of Future and an Application , 1999, J. Funct. Program..

[22]  George Candea,et al.  Microreboot - A Technique for Cheap Recovery , 2004, OSDI.

[23]  Olivier Danvy,et al.  On Evaluation Contexts, Continuations, and the Rest of the Computation , 2004 .

[24]  Alan Dearle,et al.  On page-based optimistic process checkpointing , 1995, Proceedings of International Workshop on Object Orientation in Operating Systems.

[25]  John Rosenberg,et al.  Protection in Grasshopper: A Persistent Operating System , 1994, POS.

[26]  Robert Bruce Findler,et al.  Kill-safe synchronization abstractions , 2004, PLDI '04.

[27]  Andrew W. Appel,et al.  Debugging standard ML without reverse engineering , 1990, LISP and Functional Programming.

[28]  Roy Dz-Ching Ju,et al.  A compiler framework for speculative analysis and optimizations , 2003, PLDI '03.

[29]  Peter K. Szwed,et al.  Application-level checkpointing for shared memory programs , 2004, ASPLOS XI.

[30]  Roberto Bruni,et al.  Theoretical foundations for compensations in flow composition languages , 2005, POPL '05.

[31]  Carlos A. Varela,et al.  Transactors: a programming model for maintaining globally consistent distributed state in unreliable environments , 2005, POPL '05.