System Architecture for Fault-Tolerant Processes in Distributed Systems

Abstract : The primary focus of this report is on system architectures and protocols for building fault-tolerant distributed systems. It addresses algorithms and protocols in four different areas: fault-diagnosis, error recovery in replicated systems, error recovery based on self-stabilization, and the use of masking redundancy in replicated systems using agreement protocols. This report is a collection of six technical papers that present the results obtained in this area. The first paper describes a system architecture for building resilient processes using replication and check pointing. It describes the protocols for process replication management. The second paper presents an agreement protocol which provides the same view of the computation state to each correctly functioning copy of the process. The third paper presents a protocol for self-stabilization in binary trees. This protocol is a generalization of one of Dijkstra's protocols and for normal operations is sufficient to guarantee recovery from any erroneous state. The fourth paper presents a protocol for detecting the termination of a set of cooperating communicating processes. The last two papers address the problems related to fault-diagnosis in interconnected systems. The first presents a survey of the various fault- diagnosis algorithms based on the model proposed by Preparata, Metze & Chen (PMC Model). The second presents some results in direction of designing more efficient fault-diagnosis algorithms. Keywords: Computer architecture; Fault tolerant computing.