From Recovery Blocks to Concurrent Atomic Actions

This paper reviews the development of error recovery structures that support general fault tolerance, and describes a new object-oriented scheme for error recovery in concurrent systems that generalizes existing schemes based on either conversations or transactions. This new scheme, which is based on what we term a Coordinated Atomic Action, is intended to facilitate the provision of means of tolerating hardware and software faults, and faults that have affected the environment of the computer system — and to do so for programs that involve cooperating concurrent processes, and the use of shared resources.

[1]  K. H. Kim,et al.  A distributed fault tolerant architecture for nuclear reactor and other critical process control applications , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[2]  K. H. Kim,et al.  Approaches to Mechanization of the Conversation Scheme Based on Monitors , 1982, IEEE Transactions on Software Engineering.

[3]  Brian Randell,et al.  Object-Oriented Software Fault Tolerance: Framework, reuse and design diversity , 1993 .

[4]  Hermann Kopetz,et al.  Fault tolerance, principles and practice , 1990 .

[5]  K. H. Kim,et al.  Distributed Execution of Recovery Blocks: An Approach to Uniform Treatment of Hardware and Software Faults , 1984, IEEE International Conference on Distributed Computing Systems.

[6]  Brian Randell Fault Tolerance and System Structuring , 1984 .

[7]  Santosh K. Shrivastava,et al.  An overview of the Arjuna distributed programming system , 1991, IEEE Software.

[8]  Cecília M. F. Rubira,et al.  Fault tolerance in concurrent object-oriented software through coordinated error recovery , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[9]  Brian Randell System structure for software fault tolerance , 1975 .

[10]  John C. Knight,et al.  A Framework for Software Fault Tolerance in Real-Time Systems , 1983, IEEE Transactions on Software Engineering.

[11]  Brian Randell,et al.  The Evolution of the Recovery Block Concept , 1994 .

[12]  D. B. Lomet Process structuring, synchronization, and recovery using atomic actions , 1977 .

[13]  David F. McAllister,et al.  The consensus recovery block , 1983 .

[14]  Gerald M. Masson,et al.  Using certification trails to achieve software fault tolerance , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[15]  Parameswaran Ramanathan,et al.  Checkpointing and rollback recovery in a distributed system using common time base , 1988, Proceedings [1988] Seventh Symposium on Reliable Distributed Systems.

[16]  Michel Banâtre,et al.  The Concept of Multi-function: A General Structuring Tool for Distributed Operating System , 1986, ICDCS.

[17]  Peter A. Barrett,et al.  Software Fault Tolerance: An Evaluation , 1985, IEEE Transactions on Software Engineering.

[18]  K. H. Kim,et al.  Distributed Execution of Recovery Blocks: An Approach for Uniform Treatment of Hardware and Software Faults in Real-Time Applications , 1989, IEEE Trans. Computers.

[19]  Flaviu Cristian,et al.  Exception Handling and Software Fault Tolerance , 1982, IEEE Transactions on Computers.

[20]  William E. Weihl,et al.  Implementation of resilient, atomic data types , 1985, TOPL.

[21]  H. Hecht,et al.  Fault-Tolerant Software for Real-Time Applications , 1976, CSUR.

[22]  Barbara Liskov,et al.  Distributed programming in Argus , 1988, CACM.

[23]  Santosh K. Shrivastava,et al.  Reliable Resource Allocation Betvveen Unreliable Processes , 1978, IEEE Transactions on Software Engineering.

[24]  Peter A. Lee A Reconsideration of the Recovery Block Scheme , 1978, Comput. J..

[25]  Kang G. Shin,et al.  Evaluation of Error Recovery Blocks Used for Cooperating Processes , 1984, IEEE Transactions on Software Engineering.

[26]  Maurice Herlihy,et al.  Apologizing versus asking permission: optimistic concurrency control for abstract data types , 1990, TODS.

[27]  Paul Ammann,et al.  Data Diversity: An Approach to Software Fault Tolerance , 1988, IEEE Trans. Computers.

[28]  K.H. Kim,et al.  A highly decentralized implementation model for the programmer-transparent coordination (PTC) scheme for cooperative recovery , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[29]  B. Randell,et al.  STATE RESTORATION IN DISTRIBUTED SYSTEMS , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[30]  Brian Randell,et al.  Error recovery in asynchronous systems , 1986, IEEE Transactions on Software Engineering.

[31]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[32]  David L. Russell,et al.  State Restoration in Systems of Communicating Processes , 1980, IEEE Transactions on Software Engineering.

[33]  Andrea Bondavalli,et al.  A Cost-Effective and Flexible Scheme for Software fault Tolerance , 1993 .

[34]  P. M. Melliar-Smith,et al.  A program structure for error detection and recovery , 1974, Symposium on Operating Systems.

[35]  Kwang-Hae Kim,et al.  Approaches to implementation of a repairable distributed recovery block scheme , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[36]  Roy H. Campbell,et al.  Atomic actions for fault-tolerance using CSP , 1986, IEEE Transactions on Software Engineering.