Fault tolerance in decentralized systems

In a decentralised system the problems of fault tolerance, and in particular error recovery, vary greatly depending on the design assumptions. For example, in a distributed database system, if one disregards the possibility of undetected invalid inputs or outputs, the errors that have to be recovered from will just affect the database, and backward error recovery will be feasible and should suffice. Such a system is typically supporting a set of activities that are competing for access to a shared database, but which are otherwise essentially independent of each other in such circumstances conventional database transaction processing and distributed protocols enable backward recovery to be provided very effectively. But in more general systems the multiple activities will often not simply be competing against each other, but rather will at times be attempting to co-operate with each other, in pursuit of some common goal. Moreover, the activities in decentralised systems typically involve not just computers, but also external entities that are not capable of backward error recovery. Such additional complications make the task of error recovery more challenging, and indeed more interesting. This paper provides a brief analysis of the consequences of various such complications, and outlines some recent work on advanced error recovery techniques that they have motivated.

[1]  Brian Randell,et al.  Using coordinated atomic actions to design safety‐critical systems: a production cell case study , 1999 .

[2]  Claus Lewerentz,et al.  Formal Development of Reactive Systems , 1995, Lecture Notes in Computer Science.

[3]  B. Randell,et al.  Using Coordinated Atomic Actions to Design Complex Safety-critical Systems: the Production Cell Case Study , 1997 .

[4]  Cecília M. F. Rubira,et al.  Fault tolerance in concurrent object-oriented software through coordinated error recovery , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[5]  Brian Randell,et al.  Error recovery in asynchronous systems , 1986, IEEE Transactions on Software Engineering.

[6]  Santosh K. Shrivastava,et al.  Checked transactions in an asynchronous message passing environment , 1998, Proceedings First International Symposium on Object-Oriented Real-Time Distributed Computing (ISORC '98).

[7]  Andreas Reuter,et al.  Transaction Processing: Concepts and Techniques , 1992 .

[8]  Jie Xu,et al.  Coordinated exception handling in distributed object systems: from model to system implementation , 1998, Proceedings. 18th International Conference on Distributed Computing Systems (Cat. No.98CB36183).

[9]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[10]  K. H. Kim,et al.  Approaches to Mechanization of the Conversation Scheme Based on Monitors , 1982, IEEE Transactions on Software Engineering.

[11]  Claus Lewerentz,et al.  Formal Development of Reactive Systems: Case Study Production Cell , 1995 .

[12]  Jim Gray,et al.  A census of Tandem system availability between 1985 and 1990 , 1990 .

[13]  Santosh K. Shrivastava,et al.  The Design and Implementation of Arjuna , 1995, Comput. Syst..