Stabilizers: A Safe Lightweight Checkpointing Abstraction for Concurrent Programs

A checkpoint is a mechanism that allows program execution to be restarted from a previously saved state. Checkpoints can be used in conjunction with exception handling abstractions to recover from exceptional or erroneous events, to support debugging or replay mechanisms, or to facilitate algorithms that rely on specul ative evaluation. While relatively straightforward to describe in a sequential setting, for example through the capture and application of continuations, it is less clear how to ascribe a meaningfu l semantics for safe checkpoints in the presence of concurrency. For a thread to correctly resume execution from a saved checkpoint, it must ensure that all other threads which have witnessed its unwanted effects after the establishment of the checkpoint are also reverted to a meaningful earlier state. If this is not done, d ata inconsistencies and other undesirable behavior may result. However, automatically determining what constitutes a consistent global state is not straightforward since thread interactions are a dyna mic property of the program; requiring applications to specify such states explicitly is not pragmatic. In this paper, we present a safe and efficient on-the-fly check pointing mechanism for concurrent programs. We introduce a new linguistic abstraction called stabilizers that permits the specification of per-thread checkpoints and the restoration of globally consistent checkpoints. Global checkpoints are computed through lightweight monitoring of communication events among threads (e.g. message-passing operations or updates to shared variables). Our implementation results show that the memory and computation overheads for using stabilizers average roughly 4 to 6% on our benchmark suite, leading us to conclude that stabilizers ar e a viable mechanism for defining restorable state in concurrent progr ams.

[1]  Andrew W. Appel,et al.  Compiling with Continuations , 1991 .

[2]  Andrew W. Appel,et al.  Debuggable concurrency extensions for standard ML , 1991, PADD '91.

[3]  Suresh Jagannathan,et al.  Safe futures for Java , 2005, OOPSLA '05.

[4]  John Rosenberg,et al.  Operating system support for persistent and recoverable computations , 1996, CACM.

[5]  David D. McDonald,et al.  Programs , 1984, CL.

[6]  Suresh Jagannathan,et al.  Stabilizers: Safe Lightweight Check- pointing for Concurrent Programs , 2005 .

[7]  Computer Staff,et al.  Transaction processing , 1994 .

[8]  Andrew W. Appel,et al.  Space-efficient closure representations , 1994, LFP '94.

[9]  Joe Marshall,et al.  Continuations from generalized stack inspection , 2005, ICFP '05.

[10]  Robert Gruber,et al.  Efficient optimistic concurrency control using loosely synchronized clocks , 1995, SIGMOD '95.

[11]  George Candea,et al.  Microreboot - A Technique for Cheap Recovery , 2004, OSDI.

[12]  Maurice Herlihy,et al.  Software transactional memory for dynamic-sized data structures , 2003, PODC '03.

[13]  Robert H. B. Netzer,et al.  Optimal tracing and incremental reexecution for debugging long-running programs , 1994, PLDI '94.

[14]  John H. Reppy,et al.  Concurrent programming in ML , 1999 .

[15]  J. T. Robinson,et al.  On optimistic methods for concurrency control , 1979, TODS.

[16]  Panos K. Chrysanthis,et al.  ACTA: The SAGA Continues , 1992, Database Transaction Models for Advanced Applications.

[17]  Roy Dz-Ching Ju,et al.  A compiler framework for speculative analysis and optimizations , 2003, PLDI '03.

[18]  Carlos A. Varela,et al.  Transactors: a programming model for maintaining globally consistent distributed state in unreliable environments , 2005, POPL '05.

[19]  Alan Dearle,et al.  On page-based optimistic process checkpointing , 1995, Proceedings of International Workshop on Object Orientation in Operating Systems.

[20]  Keir Fraser,et al.  Language support for lightweight transactions , 2003, SIGP.

[21]  Chita R. Das,et al.  Selective checkpointing and rollbacks in multithreaded distributed systems , 2001, Proceedings 21st International Conference on Distributed Computing Systems.

[22]  SchulzMartin,et al.  Application-level checkpointing for shared memory programs , 2004 .

[23]  Peter K. Szwed,et al.  Application-level checkpointing for shared memory programs , 2004, ASPLOS XI.

[24]  John Rosenberg,et al.  Protection in Grasshopper: A Persistent Operating System , 1994, POS.

[25]  Robert Bruce Findler,et al.  Kill-safe synchronization abstractions , 2004, PLDI '04.

[26]  Daniel Marques,et al.  Automated application-level checkpointing of MPI programs , 2003, PPoPP '03.

[27]  Jeffrey F. Naughton,et al.  Real-time, concurrent checkpoint for parallel programs , 1990, PPOPP '90.

[28]  Andrew W. Appel,et al.  Debugging standard ML without reverse engineering , 1990, LISP and Functional Programming.

[29]  Jian Xu,et al.  Adaptive message logging for incremental program replay , 1993, IEEE Parallel & Distributed Technology: Systems & Applications.

[30]  Bjarne Steensgaard,et al.  Integrating support for undo with exception handling , 2004 .

[31]  Simon L. Peyton Jones,et al.  Composable memory transactions , 2005, CACM.

[32]  Kai Li,et al.  CLIP: A Checkpointing Tool for Message Passing Parallel Programs , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[33]  Jeremy Manson,et al.  The Java memory model , 2005, POPL '05.

[34]  Micah Beck,et al.  Compiler-Assisted Checkpointing , 1994 .

[35]  Suresh Jagannathan,et al.  Transactional Monitors for Concurrent Objects , 2004, ECOOP.

[36]  Roberto Bruni,et al.  Theoretical foundations for compensations in flow composition languages , 2005, POPL '05.

[37]  William R. Dieter,et al.  A user-level checkpointing library for POSIX threads programs , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[38]  Frank Huch,et al.  Searching for deadlocks while debugging concurrent haskell programs , 2004, ICFP '04.