Chapter 1: The evolution of the recovery block concept

This chapter reviews the development of the recovery block approach to software fault tolerance and subsequent work based on this approach. It starts with an account of the development and implementations of the basic recovery block scheme in the early 1970s at Newcastle, and then goes on to describe work at Newcastle and elsewhere on extensions to the basic scheme, recovery in concurrent systems, and linguistic support for recovery blocks based on the use of object-oriented programming concepts.

[1]  Andrew M. Tyrrell,et al.  Design of reliable software in distributed systems using the conversation scheme , 1986, IEEE Transactions on Software Engineering.

[2]  Gerald M. Masson,et al.  Using certification trails to achieve software fault tolerance , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[3]  I. Bey,et al.  Delta-4: A Generic Architecture for Dependable Distributed Computing , 1991, Research Reports ESPRIT.

[4]  J. Xu,et al.  Toward an object-oriented approach to software fault tolerance , 1994, Proceedings of IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems.

[5]  Parameswaran Ramanathan,et al.  Checkpointing and rollback recovery in a distributed system using common time base , 1988, Proceedings [1988] Seventh Symposium on Reliable Distributed Systems.

[6]  C. A. R. Hoare,et al.  Communicating sequential processes , 1978, CACM.

[7]  Geneva G. Belford,et al.  SIMULATIONS OF A FAULT-TOLERANT DEADLINE MECHANISM. , 1979 .

[8]  David L. Russell,et al.  State Restoration in Systems of Communicating Processes , 1980, IEEE Transactions on Software Engineering.

[9]  Santosh K. Shrivastava,et al.  Fault-Tolerant Sequential Programming Using Recovery Blocks , 1985 .

[10]  Myron Hecht,et al.  A distributed fault tolerant architecture for nuclear reactor control and safety functions , 1989, [1989] Proceedings. Real-Time Systems Symposium.

[11]  Roy H. Campbell,et al.  FAULT TOLERANCE USING COMMUNICATING SEQUENTIAL PROCESSES. , 1984 .

[12]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[13]  K. H. Kim,et al.  Approaches to Mechanization of the Conversation Scheme Based on Monitors , 1982, IEEE Transactions on Software Engineering.

[14]  J. N. Chelotti,et al.  A software fault tolerance experiment for space applications , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[15]  Peter A. Barrett,et al.  Software Fault Tolerance: An Evaluation , 1985, IEEE Transactions on Software Engineering.

[16]  Andrea Bondavalli,et al.  Structured software fault-tolerance with BSM , 1992, Proceedings of the Third Workshop on Future Trends of Distributed Computing Systems.

[17]  R. D. Royer,et al.  The 5ESS switching system: Maintenance capabilities , 1985, AT&T Technical Journal.

[18]  Taesoon Park,et al.  Checkpointing and rollback-recovery in distributed systems , 1989 .

[19]  A. Yonezawa,et al.  An introduction to object-based reflective concurrent computation , 1988, OOPSLA/ECOOP '88.

[20]  K.H. Kim,et al.  A highly decentralized implementation model for the programmer-transparent coordination (PTC) scheme for cooperative recovery , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[21]  B. Randell,et al.  STATE RESTORATION IN DISTRIBUTED SYSTEMS , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[22]  Brian Randell,et al.  Error recovery in asynchronous systems , 1986, IEEE Transactions on Software Engineering.

[23]  R. Kerr,et al.  Recovery blocks in action: A system supporting high reliability , 1976, ICSE '76.

[24]  Takashi Masuda,et al.  Designing an Extensible Distributed Language with a Meta-Level Architecture , 1993, ECOOP.

[25]  Brian Randell Fault Tolerance and System Structuring , 1984 .

[26]  Gita Gopal,et al.  Software fault tolerance in telecommunications systems , 1990, EW 4.

[27]  Andrew M. Tyrrell,et al.  The specification and design of atomic actions for fault tolerant concurrent software , 1992, Microprocess. Microprogramming.

[28]  N. Ghani,et al.  A Recovery Cache for the PDP-11 , 1980, IEEE Transactions on Computers.

[29]  P. M. Melliar-Smith,et al.  A program structure for error detection and recovery , 1974, Symposium on Operating Systems.

[30]  K. H. Kim,et al.  Distributed Execution of Recovery Blocks: An Approach to Uniform Treatment of Hardware and Software Faults , 1984, IEEE International Conference on Distributed Computing Systems.

[31]  P. M. Melliar-Smith,et al.  Software reliability: The role of programmed exception handling , 1977, Language Design for Reliable Software.

[32]  Ann T. Tai,et al.  Evaluation of Fault-Tolerant Software: A Performability Modeling Approach , 1993 .

[33]  Santosh K. Shrivastava,et al.  An overview of the Arjuna distributed programming system , 1991, IEEE Software.

[34]  Peter A. Barrett,et al.  Towards an integrated approach to fault tolerance in Delta-4 , 1993, Distributed Syst. Eng..

[35]  David F. McAllister,et al.  Fault-Tolerant SoFtware Reliability Modeling , 1987, IEEE Transactions on Software Engineering.

[36]  Valérie Issarny An exception handling mechanism for parallel object-oriented programming , 1992 .

[37]  Kwang-Hae Kim,et al.  Approaches to implementation of a repairable distributed recovery block scheme , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[38]  S. K. Shrivastava,et al.  Sequential pascal with recovery blocks , 1978, Softw. Pract. Exp..

[39]  John C. Knight,et al.  On the provision of backward error recovery in production programming languages , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[40]  Daniel P. Siewiorek,et al.  High-availability computer systems , 1991, Computer.

[41]  Gerald M. Masson,et al.  Certification trails for data structures , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[42]  Peter A. Lee A Reconsideration of the Recovery Block Scheme , 1978, Comput. J..

[43]  Andrea Bondavalli,et al.  A Cost-Effective and Flexible Scheme for Software fault Tolerance , 1993 .

[44]  Kang G. Shin,et al.  Evaluation of Error Recovery Blocks Used for Cooperating Processes , 1984, IEEE Transactions on Software Engineering.

[45]  Wooyoung Kim A Linguistic Framework for Dynamic Composition of Dependability Protocols , 1993 .

[46]  Paul Ammann,et al.  Data Diversity: An Approach to Software Fault Tolerance , 1988, IEEE Trans. Computers.

[47]  H. Hecht,et al.  Fault-Tolerant Software for Real-Time Applications , 1976, CSUR.

[48]  David F. McAllister,et al.  The consensus recovery block , 1983 .

[49]  Samuel Thurston Gregory Programming language facilities for backward error recovery in real-time systems , 1987 .

[50]  Andrea Clematis,et al.  Structuring Conversation in Operation/Procedure Oriented Programming Languages , 1993, Comput. Lang..

[51]  Pattie Maes Concepts and experiments in computational reflection , 1987, OOPSLA 1987.

[52]  John C. Knight,et al.  A Framework for Software Fault Tolerance in Real-Time Systems , 1983, IEEE Transactions on Software Engineering.

[53]  Jean Arlat,et al.  Dependability Modeling and Evaluation of Software Fault-Tolerant Systems , 1990, IEEE Trans. Computers.

[54]  Santosh K. Shrivastava Concurrent Pascal with backward error recovery: language features and examples , 1979 .

[55]  Brian Randell,et al.  Object-Oriented Software Fault Tolerance: Framework, reuse and design diversity , 1993 .

[56]  Hermann Kopetz,et al.  Fault tolerance, principles and practice , 1990 .

[57]  Geppino Pucci On the modelling and testing of recovery block structures , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[58]  Santosh K. Shrivastava,et al.  The duality of fault‐tolerant system structures , 1993, Softw. Pract. Exp..

[59]  Santosh K. Shrivastava,et al.  Reliable Resource Allocation Betvveen Unreliable Processes , 1978, IEEE Transactions on Software Engineering.

[60]  Kishor S. Trivedi,et al.  Modeling Correlation in Software Recovery Blocks , 1993, IEEE Trans. Software Eng..

[61]  Andrea Clematis,et al.  A system architecture for fault tolerance in concurrent software , 1990, Computer.

[62]  K. H. Kim,et al.  Distributed Execution of Recovery Blocks: An Approach for Uniform Treatment of Hardware and Software Faults in Real-Time Applications , 1989, IEEE Trans. Computers.

[63]  Bertrand Meyer,et al.  Eiffel: The Language , 1991 .

[64]  Flaviu Cristian,et al.  Exception Handling and Software Fault Tolerance , 1982, IEEE Transactions on Computers.

[65]  Valrie Issarny Programming Notations for Expressing Error Recovery in a Distributed Object-Oriented Language , 1993 .

[66]  K. H. Kim,et al.  An analysis of the performance impacts of lookahead execution in the conversation scheme , 1988, Proceedings [1988] Seventh Symposium on Reliable Distributed Systems.

[67]  Myron Hecht,et al.  Software reliability in the system context , 1986, IEEE Transactions on Software Engineering.