Reconfiguration and transient recovery in state machine architectures

We consider an architecture for ultra-dependable operation based on synchronized state machine replication, extended to provide transient recovery and reconfiguration in the presence of arbitrary faults. The architecture allows processors suspected of being faulty to be placed on "probation." Processors in this status cannot disrupt other processors, but those that are nonfaulty or recovering from transient faults are able to remain synchronized with the other processors and with each other, can participate in interactively consistent exchange of data (i.e., Byzantine agreement), and can restore damaged state data by loading majority-voted copies from other processors. The processors that are not on probation are able to coordinate membership of their group and to take processors on and off probation. These properties are achieved even if all the processors on probation and some of the others exhibit Byzantine faults, provided a majority of all processors are nonfaulty. Key elements of the architecture are modified treatments for the problems of interactive consistency, clock synchronization, and group membership. Classical algorithms for these problems that tolerate t Byzantine faults among n processors are extended to tolerate t+p faults among n+p processors, partitioned into n "core members" and p "probationers," provided no more than t faults occur among the core members.

[1]  B SchneiderFred Implementing fault-tolerant services using the state machine approach: a tutorial , 1990 .

[2]  Flaviu Cristian,et al.  Agreeing on who is present and who is absent in a synchronous distributed system , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[3]  Kang G. Shin,et al.  DIAGNOSIS OF PROCESSORS WITH BYZANTINE FAULTS IN A DISTRIBUTED COMPUTING SYSTEM. , 1987 .

[4]  Nancy A. Lynch,et al.  A Lower Bound for the Time to Assure Interactive Consistency , 1982, Inf. Process. Lett..

[5]  Hermann Kopetz,et al.  Distributed fault-tolerant real-time systems: the Mars approach , 1989, IEEE Micro.

[6]  John Rushby A formally verified algorithm for clock synchronization under a hybrid fault model , 1994, PODC '94.

[7]  Richard W. Buskens,et al.  Distributed on-line diagnosis in the presence of arbitrary faults , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[8]  Ben L. Di Vito,et al.  Formal Techniques for Synchronized Fault-Tolerant Systems , 1992 .

[9]  Leslie Lamport,et al.  Reaching Agreement in the Presence of Faults , 1980, JACM.

[10]  P. Reynier,et al.  Active replication in Delta-4 , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[11]  Danny Dolev,et al.  Authenticated Algorithms for Byzantine Agreement , 1983, SIAM J. Comput..

[12]  J. H. Lala,et al.  Architectural principles for safety-critical real-time applications , 1994, Proc. IEEE.

[13]  Carl E. Landwehr,et al.  Dependable Computing for Critical Applications 4 , 1995, Dependable Computing and Fault-Tolerant Systems.

[14]  John M. Rushby,et al.  Model-Based Reconfiguration: Toward an Integration with Diagnosis , 1991, AAAI.

[15]  John Rushby A FAULT-MASKING AND TRANSIENT-RECOVERY MODEL FOR DIGITAL FLIGHT-CONTROL SYSTEMS , 1993 .

[16]  Patrick Lincoln,et al.  A Formally Verified Algorithm for Interactive Consistency Under a Hybrid Fault Model , 1993, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[17]  J. Goldberg,et al.  SIFT: Design and analysis of a fault-tolerant computer for aircraft control , 1978, Proceedings of the IEEE.

[18]  Chris J. Walter,et al.  The MAFT Architecture for Distributed Fault Tolerance , 1988, IEEE Trans. Computers.

[19]  P. Lincoln,et al.  Byzantine Agreement with Authentication : Observations andApplications in Tolerating Hybrid and Link Faults , 1995 .

[20]  P. M. Melliar-Smith,et al.  Synchronizing clocks in the presence of faults , 1985, JACM.

[21]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[22]  Chris J. Walter Identifying the cause of detected errors , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[23]  Philip M. Thambidurai,et al.  Interactive consistency with multiple failure modes , 1988, Proceedings [1988] Seventh Symposium on Reliable Distributed Systems.

[24]  Neeraj Suri,et al.  Continual On-Line Diagnosis of Hybrid Faults , 1995 .

[25]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[26]  Chris J. Walter,et al.  Clock synchronization in MAFT , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[27]  Danny Dolev,et al.  Fault-tolerant clock synchronization , 1984, PODC '84.