Replay, recovery, replication, and snapshots of nondeterministic concurrent programs

The problem of replaying computations of nondeterministic concurrent programs arises in contexts such as debugging and recovery. We i nvestigate the problem for an abstract model of concurrency, which generalizes dataffow nct,works, processors with sh acred variables, and logic programming g models of concurrency, We say that nondeterminism is visible if the state is determined, np to some (appropriately defined) notion of eqnivakmce, by the cxtern al behavior. We show that if nondeterminism is visible then replay is achievable using a. one-step Irmkahead sequential simulation algorithm. Jf the program has an additional monotonicity property called .Stizbility then recovery is possible without sirnnlating the original computation, by restarting the program from a. cer‘Authors’ addresses: Haim Gaifman, lnstitote of Mathematics snd Computer Science, Hebrew [University, C~ivat Ram, Jerusalem, 91904, Israel Michael J. Maher, H3M, T.J. Watson Research Center, PO. Box 704, Yorktown Heights, NY 10598, 7J.S.A. 13hud Shapiro, Department, of Applied Mathelnatics and Computer Science, The Weizmann Inatitote of Science, Rehovot, 76100, Israel Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. @ 1991 ACM 0-89791-439-2/91/0007/0241 $1.50 t,ain easily constructed state. Also, for stable programs with visible nondeterminisrn, a process composed of identical parallel processes has the same external beha,vio~ as each of its components. Hence high crash-failure resilience is achievable by simple process replication. For such programs the~e is also an easy solution to the asynchronous snapshot problem. Stability 11OMS for certain concurrent logic/constraint programming tangnages. We describe an eff[cient method for transforming a given stable concurrent logiciconstra,int program to an equivalent one wit, h visible nondet erminism. The t ransforma.tion has acceptable thns it could be empIoyed in a d the proposed methods.

[1]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[2]  Ehud Shapiro,et al.  The family of concurrent logic programming languages , 1989, CSUR.

[3]  Yoram Moses,et al.  Distributed variable server for atomic unification , 1990, PODC '90.

[4]  Michael J. Maher Logic Semantics for a Class of Committed-Choice Programs , 1987, ICLP.

[5]  Vijay A. Saraswat,et al.  Concurrent constraint programming , 1989, POPL '90.

[6]  James Alexander Crammond,et al.  Implementation of committed choice logic languages on shared memory multiprocessors , 1988 .

[7]  Dana S. Scott,et al.  Concurrent constraint programming languages , 1989 .

[8]  Nancy A. Lynch,et al.  Hierarchical correctness proofs for distributed algorithms , 1987, PODC '87.

[9]  A. Prasad Sistla,et al.  Efficient distributed recovery using message logging , 1989, PODC '89.

[10]  Thomas J. LeBlanc,et al.  Debugging Parallel Programs with Instant Replay , 1987, IEEE Transactions on Computers.

[11]  Ehud Shapiro,et al.  Separating concurrent languages with categories of language embeddings , 1991, STOC '91.

[12]  Robin Milner,et al.  A Calculus of Communicating Systems , 1980, Lecture Notes in Computer Science.

[13]  Steven Fortune,et al.  Parallelism in random access machines , 1978, STOC.

[14]  Kai Li,et al.  Shared virtual memory on loosely coupled multiprocessors , 1986 .

[15]  Kenneth M. Kahn,et al.  Detecting stable properties of networks in concurrent logic programming languages , 1988, PODC '88.

[16]  Charles E. McDowell,et al.  Debugging concurrent programs , 1989, ACM Comput. Surv..

[17]  A. Ehrenfeucht An application of games to the completeness problem for formalized theories , 1961 .

[18]  Charles L. Seitz,et al.  Multicomputers: message-passing concurrent computers , 1988, Computer.

[19]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[20]  Jeffrey D. Ullman,et al.  Principles of database and knowledge-base systems, Vol. I , 1988 .

[21]  Richard H. Carver,et al.  Reproducible Testing of Concurrent Programs Based on Shared Variables , 1986, ICDCS.

[22]  Jeffrey F. Naughton,et al.  Real-time, concurrent checkpoint for parallel programs , 1990, PPOPP '90.