Optimistic recovery in multi-threaded distributed systems

The problem of recovering distributed systems from crash failures has been widely studied in the context of traditional non-threaded processes. However, extending those solutions to the multi-threaded scenario presents new problems. We identify and address these problems for optimistic logging protocols. There are two natural extension to optimistic logging protocols in the multi-threaded scenario. The first extension is process-centric, where the points of internal non-determinism caused by threads are logged. The second extension is thread-centric, where each thread is treated as a separate process. The process-centric approach suffers from false causality while the thread-centric approach suffers from high causality tracking overhead. By observing that the granularity of failures can be different from the granularity of rollbacks, we design a new balanced approach which incurs low causality tracking overhead and also eliminates false causality.

[1]  R. Stephenson A and V , 1962, The British journal of ophthalmology.

[2]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[3]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[4]  Thomas J. LeBlanc,et al.  Debugging Parallel Programs with Instant Replay , 1987, IEEE Transactions on Computers.

[5]  Thomas J. LeBlanc,et al.  A software instruction counter , 1989, ASPLOS 1989.

[6]  Arthur P. Goldberg Transparent Recovery of Mach Applications , 1990, USENIX MACH Symposium.

[7]  Richard H. Carver,et al.  Debugging Concurrent Ada Programs by Deterministic Execution , 1991, IEEE Trans. Software Eng..

[8]  Lorenzo Alvisi,et al.  Deriving optimal checkpoint protocols for distributed shared memory architectures , 1995, PODC '95.

[9]  Bil Lewis,et al.  Threads Primer: A Guide to Multithreaded Programming , 1995 .

[10]  Vijay K. Garg,et al.  How to recover efficiently and asynchronously when optimism fails , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[11]  Mark Russinovich,et al.  Replay for concurrent non-deterministic shared-memory applications , 1996, PLDI '96.

[12]  E. N. Elnozahy,et al.  Supporting nondeterministic execution in fault-tolerant systems , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[13]  Doug Lea,et al.  Concurrent programming in Java - design principles and patterns , 1996, Java series.

[14]  Vijay K. Garg,et al.  Distributed recovery with K-optimistic logging , 1997, Proceedings of 17th International Conference on Distributed Computing Systems.

[15]  Joel H. Saltz,et al.  Network-aware mobile programs , 1997 .

[16]  Stefan Fünfrocken Transparent Migration of Java-Based Mobile Agents , 1998, Mobile Agents.