Unified approach to synchronous and asynchronous approximate agreement in the presence of hybrid faults

An important problem in fault-tolerant distributed computer systems is maintaining agreement between nonfaulty processes in the presence of undiagnosed faults. Approximate agreement defines a condition in which it is not necessary for the agreed values to be numerically identical. Rather, processes need only agree with each other to within a predefined numerical tolerance. Convergent voting algorithms which achieve approximate agreement have been studied in the context of two classes of systems, synchronous and asynchronous. Studies have also addressed both completely connected and partially connected systems. Together, the two properties of synchrony and connectivity yield 4 different voting domains. In all studies to date, each voting domain has been treated as a separate problem. This paper: shows that for at least one broad family of voting algorithms, the 4 domains are special cases of a more general convergent voting problem; analyzes convergent voting under the 3-mode hybrid fault model of Thambidurai and Park; and presents a set of unifying relations applicable to all 4 voting domains. These relations are used to specify voting algorithms which optimize fault-tolerance, convergence rate, or computational overhead in any given voting domain. The task of designing a voting algorithm for a particular fault-tolerant system is thus greatly simplified.

[1]  Patrick Lincoln,et al.  A Formally Verified Algorithm for Interactive Consistency Under a Hybrid Fault Model , 1993, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[2]  Liming Chen,et al.  N-VERSION PROGRAMMINC: A FAULT-TOLERANCE APPROACH TO RELlABlLlTY OF SOFTWARE OPERATlON , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[3]  Dhiraj K. Pradhan,et al.  Consensus With Dual Failure Modes , 1991, IEEE Trans. Parallel Distributed Syst..

[4]  Joep L. W. Kessels Two Designs of a Fault-Tolerant Clocking System , 1984, IEEE Transactions on Computers.

[5]  Daniel P. Siewiorek,et al.  Fault free performance validation of a fault-tolerant multiprocessor : baseline and synthetic workload measurements , 1985 .

[6]  Neeraj Suri,et al.  Continual On-Line Diagnosis of Hybrid Faults , 1995 .

[7]  Patrick Lincoln,et al.  The Formal Verification of an Algorithm for Interactive Consistency under a Hybrid Fault Model , 1993, CAV.

[8]  Mohammad Hassan Azadmanesh Reaching approximate agreement with multiple fault-modes , 1993 .

[9]  Fred B. Schneider,et al.  Understanding Protocols for Byzantine Clock Synchronization , 1987 .

[10]  Peter N. Marinos,et al.  Synchronization of Fault-Tolerant Clocks in the Presence of Malicious Failures , 1988, IEEE Trans. Computers.

[11]  Brian A. Coan,et al.  A Compiler that Increases the Fault Tolerance of Asynchronous Protocols , 1988, IEEE Trans. Computers.

[12]  Kang G. Shin,et al.  Ensuring Fault Tolerance of Phase-Locked Clocks , 1985, IEEE Transactions on Computers.

[13]  Philip M. Thambidurai,et al.  Interactive consistency with multiple failure modes , 1988, Proceedings [1988] Seventh Symposium on Reliable Distributed Systems.

[14]  Parameswaran Ramanathan,et al.  Clock Synchronization of a Large Multiprocessor System in the Presence of Malicious Faults , 1987, IEEE Transactions on Computers.

[15]  C. L. Liu Elements of Discrete Mathematics , 1985 .

[16]  Chris J. Walter,et al.  The MAFT Architecture for Distributed Fault Tolerance , 1988, IEEE Trans. Computers.

[17]  D. L. Palumbo,et al.  Measurement of SIFT operating system overhead , 1985 .

[18]  P. M. Melliar-Smith,et al.  Synchronizing clocks in the presence of faults , 1985, JACM.