Gracefully degrading systems using the bulk-synchronous parallel model with randomised shared memory

The bulk-synchronous parallel model (BSPM) was proposed as a bridging model for parallel computation by Valiant (1990). By using randomised shared memory (RSM), this model offers an asymptotically optimal emulation of the PRAM. By using the BSPM with RSM, we show how a gracefully degrading massively parallel system can be obtained through: memory duplication to ensure global memory integrity, and to speed up the reconfiguration; a global reconfiguration method that restores the logical properties of the system, after a fault occurs. We assume fail-stop processors, single faults, no spare processors, and no significant loss of network throughput as a result of faults. Work done during reconfiguration is shared equally among the live processors, with minimal coordination. The overhead of the scheme and the graceful degradation achieved depend on the program being executed. We evaluate the reconfiguration, overhead, and graceful degradation of the system experimentally.<<ETX>>

[1]  James A. Storer,et al.  A new parallel algorithm for the knapsack problem and its implementation on a hypercube , 1990, [1990 Proceedings] The Third Symposium on the Frontiers of Massively Parallel Computation.

[2]  Hermann Hellwagner,et al.  Randomized Shared Memory - Concept and Efficiency of a Scalable Shared Memory Scheme , 1993, Parallel Computer Architectures.

[3]  William Yost,et al.  Design of a Router for Fault-Tolerant Networks , 1994, PCRCW.

[4]  K. H. Kim,et al.  Programmer-Transparent Coordination of Recovering Concurrent Processes: Philosophy and Rules for Efficient Implementation , 1988, IEEE Trans. Software Eng..

[5]  Leslie G. Valiant,et al.  Direct Bulk-Synchronous Parallel Algorithms , 1992, J. Parallel Distributed Comput..

[6]  VishkinUzi,et al.  Randomized and deterministic simulations of PRAMs by parallel machines with restricted granularity of parallel memories , 1984 .

[7]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[8]  Jörg Keller,et al.  Simulation-based Comparison of Hash Functions for Emulated Shared Memory , 1993, PARLE.

[9]  Leslie G. Valiant,et al.  Direct Bulk-Synchronous Parallel Algorithms , 1994, J. Parallel Distributed Comput..

[10]  Bapiraju Vinnakota,et al.  Synthesis of Algorithm-Based Fault-Tolerant Systems from Dependence Graphs , 1993, IEEE Trans. Parallel Distributed Syst..

[11]  Hermann Hellwagner,et al.  On the Practical Efficiency of Randomized Shared Memory , 1992, CONPAR.

[12]  Lionel M. Ni,et al.  Fault-tolerant wormhole routing in meshes , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[13]  Hermann Hellwagner Randomized Shared Memory - Concept and Efficiency of a Scalable Shared Memory Scheme , 1993 .

[14]  R. Jagannathan,et al.  Fault tolerance in parallel implementations of functional languages , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[15]  Ralph Grishman,et al.  The NYU Ultracomputer—Designing an MIMD Shared Memory Parallel Computer , 1983, IEEE Transactions on Computers.

[16]  Toshiyuki Shimizu,et al.  Performance evaluation of the AP1000 , 1993 .

[17]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[18]  Prithviraj Banerjee,et al.  Design and analysis of software reconfiguration strategies for hypercube multicomputers under multiple faults , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.