A Programming Language Approach to Fault Tolerance for Fork-Join Parallelism

When running big parallel computations on thousands of processors, the probability that an individual processor will fail during the execution cannot be ignored. Computations should be replicated, or else failures should be detected at runtime and failed subcomputations reexecuted. We follow the latter approach and propose a high-level operational semantics that detects computation failures, and allows failed computations to be restarted from the point of failure. We implement this high-level semantics with a lower-level operational semantics that provides a more accurate account of processor failures, and prove in Coq the correspondence between the high- and low-level semantics.

[1]  Arthur Charguéraud,et al.  Scheduling parallel programs by work stealing with private deques , 2013, PPoPP '13.

[2]  Jiannong Cao,et al.  An abstract model of rollback recovery control in distributed systems , 1992, OPSR.

[3]  Guy L. Steele,et al.  Proceedings of the 1984 ACM Symposium on LISP and functional programming , 1982 .

[4]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[5]  Nancy A. Lynch,et al.  Forward and Backward Simulations: I. Untimed Systems , 1995, Inf. Comput..

[6]  Robert H. Halstead,et al.  Implementation of multilisp: Lisp on a multiprocessor , 1984, LFP '84.

[7]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[8]  Chita R. Das,et al.  Selective checkpointing and rollbacks in multi-threaded object-oriented environment , 1999 .

[9]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[10]  Tom Ridge,et al.  Ott: effective tool support for the working semanticist , 2007, ICFP '07.

[11]  Sangmin Lee,et al.  Upright cluster services , 2009, SOSP '09.

[12]  F. Vaandrager Forward and Backward Simulations Part I : Untimed Systems , 1993 .

[13]  David Chase,et al.  Dynamic circular work-stealing deque , 2005, SPAA '05.

[14]  Christine Paulin-Mohring,et al.  The Coq Proof Assistant A Tutorial , 2005 .

[15]  Amr Sabry,et al.  The essence of compiling with continuations , 1993, PLDI '93.

[16]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[17]  F. Warren Burton,et al.  Executing functional programs on a virtual tree of processors , 1981, FPCA '81.