Combining tentative and definite executions for very fast dependable parallel computing

We present a general and efficient strategy for computing mtustly on unreliable parallel machines. The model of a parallel machine that we use is a CRCW PRAM with dynamic resource fluctuations: processors can fail during the computation, and may possibly bc restored later. We first introduce the notions of dejinite and tentatitie algorithms for executing a single parallel step of an ideal parallel machine on the unreliable machine. A definite algorithm is one that guarantees a correct “This research was partially supported by the National Science Foundation under grant number CCR88-6949 and by the EEC ESPRIT Basic Research Actions Project ALCOM (No 3075). t Cwent ~dress: Ecole des Hautes Etudes en Informatique, Univemit4 Fterk Descartes, 45, rue des Saints-P&res, 76006 Paris, i?kIce. Permanent address: Department of Computer Science, New York University, 251 Mercer St., New York, NY 10012-1185, USA; +1 (212) 998-3101; kedem@nyu.edu. This .suthor’s research was conducted while he was visiting the IBM Research Division at the T. J. Watson Research Center and the Institute for Advanced Computer Studies at the University of Maryland. $~M ~~~ Divigion, T. J. Watson fi~ew~ Centw, p, 0. Box 704, Yorktown Heights, NY 10598, USA; +1 (914] 784-7846; kpalam~ibrn.corn. i Computff S&we Division, University of Csdifonnia, Davis, CA 95616, USA; +1 (916) 752-1287; raghunatWris.ucdavis.edu. Part of this author’s research was conducted while he was visiting New York Univemity. ?Computm TeclMology Institute, Patras University, P. O. Box 1122, 26110 Patras, Greece; +30 (61) 225-073; spirakis~ grpatvxl.bitnet. Tbis author’s research was conducted while he was visiting New York University. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commemal advantage, the ACM copyright notice and the title of the pubhcation and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. @ 1991 ACM 089791-397-3/91/0004/038 1 $1.50 Raghunathan$ P. G. Spirakis~ execution of a step, while a tentative algorithm is one that is “highly likely” to produce a correct execution of a step on the unreliable machine. We show that any definite execution of one step requires Cl(log n) time on an *processor unreliable machine, even if all the processors functioned perfectly, This implies an $l(log n) slowdown for executing any non-trivial program on the unreliable machine, provided only definite executions are used. We get around this overhead by combining tentative and definite execution schemes appropriately, to derive correct and efllcient robust executions for arbitrary PRAM programs, with expected amortized slowdown of only 0(1) for a variety of reasonable failure models. We adeve this by using a tentative algorithm to execute each of the program’s steps, while using a definite algorithm to audit the execution at selected points. If the audit does not certify the execution as correct, then the execution is rolled back to a previous audit point and restarted from there. In contrast to this work, all previous results required a slowdown of Cl(log n), since they used definite algorithms only.

[1]  Richard Cole,et al.  The APRAM: incorporating asynchrony into the PRAM model , 1989, SPAA '89.

[2]  Michael O. Rabin,et al.  Efficient dispersal of information for security, load balancing, and fault tolerance , 1989, JACM.

[3]  Alexander A. Shvartsman Achieving Optimal CRCW PRAM Fault-Tolerance , 1991, Inf. Process. Lett..

[4]  David R. Jefferson,et al.  Virtual Time II: Storage Management in Distributed Simulation , 1990 .

[5]  David Jefferson,et al.  Virtual time II: storage management in conservative and optimistic systems , 1990, PODC '90.

[6]  R. Subramonian,et al.  Asynchronous PRAMs are (almost) as good as synchronous PRAMs , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[7]  Alfred V. Aho,et al.  The Design and Analysis of Computer Algorithms , 1974 .

[8]  Philip A. Bernstein,et al.  Concurrency Control in Distributed Database Systems , 1986, CSUR.

[9]  Richard P. Brent,et al.  The Parallel Evaluation of General Arithmetic Expressions , 1974, JACM.

[10]  Dennis Shasha,et al.  Beyond Fail-Stop: Wait-Free Serializability and Resiliency in the Presence of Slow-Down Failures , 1990 .

[11]  Alexander A. Shvartsman,et al.  Efficient Parallel Algorithms Can Be Made Robust , 1989, PODC.

[12]  Stephen A. Cook,et al.  Upper and Lower Time Bounds for Parallel Random Access Machines without Simultaneous Writes , 1986, SIAM J. Comput..

[13]  Paul G. Spirakis,et al.  Efficient robust parallel computations , 2018, STOC '90.

[14]  Alok Aggarwal,et al.  Hierarchical memory with block transfer , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[15]  Abhiram G. Ranade,et al.  How to emulate shared memory , 1991, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[16]  Richard Cole,et al.  The expected advantage of asynchrony , 1990, SPAA '90.

[17]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[18]  Phillip B. Gibbons A more practical PRAM model , 1989, SPAA '89.

[19]  Steven Fortune,et al.  Parallelism in random access machines , 1978, STOC.

[20]  Naomi Nishimura,et al.  Asynchronous shared memory parallel computation , 1990, SPAA '90.