Software schemes of reconfiguration and recovery in distributed memory multicomputers using the actor model

Ideally, a multicomputer system should cope with a processor failure by reconstructing itself-and the application running on itself-in order to maintain the available computational power of the remaining processors. We discuss the continuance of running applications through permanent processor failures. We take advantage of the characteristics of the actor model of parallel computation and dynamically checkpoint the activity of the application. Consequently, the runtime system is able to continue an application through multiple nonconcurrent processor failures. We have implemented our techniques through modifications of the runtime system of the parallel language Charm on an Intel iPSC/s hypercube. After discussing the theory and implementation, we give measurements of overhead due to fault tolerance for a number of applications and demonstrate continuance of the applications after injection of one or more faults.<<ETX>>

[1]  David B. Johnson,et al.  Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing , 1988, J. Algorithms.

[2]  Gul A. Agha,et al.  ACTORS - a model of concurrent computation in distributed systems , 1985, MIT Press series in artificial intelligence.

[3]  Jacques Ferber,et al.  Actors and agents as reflective concurrent objects: a MERING IV perspective , 1991, IEEE Trans. Syst. Man Cybern..

[4]  W. Kent Fuchs,et al.  Lazy checkpoint coordination for bounding rollback propagation , 1992, Proceedings of 1993 IEEE 12th Symposium on Reliable Distributed Systems.

[5]  Gul Agha,et al.  A LINGUISTIC FRAMEWORK FOR DYNAMIC COMPOSITION OF DEPENDABILITY PROTOCOLS , 1993 .

[6]  A. Prasad Sistla,et al.  Efficient distributed recovery using message logging , 1989, PODC '89.

[7]  David B. Johnsonandwillyzwaenepoel Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing , 1990 .

[8]  J. Koller A dynamic load balancer on the Intel hypercube , 1988, C3P.

[9]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[10]  W. Kent Fuchs,et al.  Optimistic message logging for independent checkpointing in message-passing systems , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[11]  Laxmikant V. Kalé,et al.  The Chare Kernel Parallel Programming Language and System , 1990, ICPP.

[12]  Peter de Jong,et al.  Compilation into actors , 1986, OOPWORK.

[13]  Hon Fung Li,et al.  Optimal Checkpointing and Local Recording for Domino-Free Rollback Recovery , 1987, Inf. Process. Lett..

[14]  Prithviraj Banerjee,et al.  Design and analysis of software reconfiguration strategies for hypercube multicomputers under multiple faults , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[15]  Michel Toulouse,et al.  CLAP: an object-oriented programming system for distributed memory parallel machines , 1994, OOPS.

[16]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[17]  Kun-Lung Wu,et al.  Recoverable Distributed Shared Virtual Memory , 1990, IEEE Trans. Computers.

[18]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[19]  John A. Chandy,et al.  A library-based approach to portable, parallel, object-oriented programming: interface, implementation, and application , 1994, Proceedings of Supercomputing '94.

[20]  Jeffrey F. Naughton,et al.  Checkpointing multicomputer applications , 1991, [1991] Proceedings Tenth Symposium on Reliable Distributed Systems.

[21]  Akinori Yonezawa,et al.  Modelling and programming in an object-oriented concurrent language ABCL/1 , 1987 .

[22]  Dennis Kafura,et al.  ACT++: Building a Concurrent C++ with Actors , 1989 .

[23]  Gul A. Agha,et al.  HAL: A High-Level Actor Language and Its Distributed Implementation , 1992, ICPP.

[24]  Daniel G. Theriault Issues in the Design and Implementation of Act2 , 1983 .

[25]  V. Karamcheti,et al.  Concert-efficient runtime support for concurrent object-oriented programming languages on stock hardware , 1993, Supercomputing '93.

[26]  Akinori Yonezawa,et al.  Distributed computing in ABCL/1 , 1987 .

[27]  Yuval Tamir,et al.  ERROR RECOVERY IN MULTICOMPUTERS USING GLOBAL CHECKPOINTS , 1984 .

[28]  Wooyoung Kim A Linguistic Framework for Dynamic Composition of Dependability Protocols , 1993 .

[29]  Henry Lieberman,et al.  Concurrent object-oriented programming in Act 1 , 1987 .

[30]  Steven Parkes,et al.  A class library approach to concurrent object-oriented programming with applications to VLSI CAD , 1994 .

[31]  Willy Zwaenepoel,et al.  Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.

[32]  J. Flower,et al.  Moose: a multi-tasking operating system of hypercubes , 1988, C3P.