Reaching Fault Diagnosis Agreement under a Hybrid Fault Model

The goal of the fault diagnosis agreement (FDA) problem is to make each fault-free processor detect/locate a common set of faulty processors. The problem is examined on processors with mixed fault model (also referred to as hybrid fault model). An evidence-based fault diagnosis protocol is proposed to solve the FDA problem. The proposed protocol first collects the messages which have accumulated in the Byzantine agreement protocol as the evidence. By examining the collected evidence, a fault-free processor can detect/locate which processor is faulty. Then, the network can be reconfigured by removing the detected faulty processors and the links connected to these processors from the network. The proposed protocol can detect/locate the maximum number of faulty processors to solve the FDA problem.

[1]  Gurdip Singh,et al.  Leader Election in the Presence of Link Failures , 1996, IEEE Trans. Parallel Distributed Syst..

[2]  Wei-Pang Yang,et al.  A Note on Consensus on Dual Failure Modes , 1996, IEEE Trans. Parallel Distributed Syst..

[3]  Neeraj Suri,et al.  Synchronization issues in real-time systems , 1994 .

[4]  Gerald M. Masson,et al.  Diagnosis Without Repair for Hybrid Fault Situations , 1980, IEEE Transactions on Computers.

[5]  Narsingh Deo,et al.  Graph Theory with Applications to Engineering and Computer Science , 1975, Networks.

[6]  Hector Garcia-Molina,et al.  Applications of Byzantine agreement in database systems , 1986, TODS.

[7]  Andrzej Pelc,et al.  Reliable communication in networks with Byzantine link failures , 1992, Networks.

[8]  Dhiraj K. Pradhan,et al.  Safe System Level Diagnosis , 1994, IEEE Trans. Computers.

[9]  Flaviu Cristian,et al.  Understanding fault-tolerant distributed systems , 1991, CACM.

[10]  Che-Liang Yang,et al.  Hybrid Fault Diagnosability with Unreliable Communcation Links , 1988, IEEE Trans. Computers.

[11]  K. V. S. Ramarao,et al.  On the diagnosis of Byzantine faults , 1988, Proceedings [1988] Seventh Symposium on Reliable Distributed Systems.

[12]  Danny Dolev,et al.  Shifting Gears: Changing Algorithms on the Fly to Expedite Byzantine Agreement , 1992, Inf. Comput..

[13]  Chak-Kuen Wong,et al.  A Combinatorial Problem Related to Multimodule Memory Organizations , 1974, JACM.

[14]  Gerald M. Masson,et al.  Diagnosable Systems for Intermittent Faults , 1978, IEEE Transactions on Computers.

[15]  Che-Liang Yang,et al.  A Distributed Algorithm for Fault Diagnosis in Systems with Soft Failures , 1988, IEEE Trans. Computers.

[16]  James Martin Telecommunications And The Computer , 1969 .

[17]  Philip M. Thambidurai,et al.  Interactive consistency with multiple failure modes , 1988, Proceedings [1988] Seventh Symposium on Reliable Distributed Systems.

[18]  Mateo Valero,et al.  Discrete Optimization Problem in Local Networks and Data Alignment , 1987, IEEE Transactions on Computers.

[19]  K. V. S. Ramarao,et al.  Distributed diagnosis of Byzantine processors and links , 1989, [1989] Proceedings. The 9th International Conference on Distributed Computing Systems.

[20]  Shu-Chin Wang,et al.  Reaching a Fault Detection Agreement , 1990, ICPP.

[21]  Danny Dolev,et al.  The Byzantine Generals Strike Again , 1981, J. Algorithms.

[22]  Miroslaw Malek,et al.  The consensus problem in fault-tolerant computing , 1993, CSUR.

[23]  Wei-Pang Yang,et al.  Byzantine Agreement in the Presence of Mixed Faults on Processors and Links , 1998, IEEE Trans. Parallel Distributed Syst..

[24]  Nancy A. Lynch,et al.  A Lower Bound for the Time to Assure Interactive Consistency , 1982, Inf. Process. Lett..

[25]  Bruce M. McMillin,et al.  Byzantine fault-tolerance through application oriented specification , 1987 .

[26]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1985, JACM.

[27]  Patrick Lincoln,et al.  A Formally Verified Algorithm for Interactive Consistency Under a Hybrid Fault Model , 1993, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[28]  Sam Toueg,et al.  Unreliable Failure Detectors for Asynchronous Systems , 1991 .

[29]  Leslie Lamport,et al.  Reaching Agreement in the Presence of Faults , 1980, JACM.

[30]  Shu-Chin Wang,et al.  Optimal Agreement Protocol in Malicious Faulty Processors and Faulty Links , 1992, IEEE Trans. Knowl. Data Eng..

[31]  Richard W. Buskens,et al.  Distributed on-line diagnosis in the presence of arbitrary faults , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[32]  GERNOT METZE,et al.  On the Connection Assignment Problem of Diagnosable Systems , 1967, IEEE Trans. Electron. Comput..

[33]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[34]  Dhiraj K. Pradhan,et al.  Consensus With Dual Failure Modes , 1991, IEEE Trans. Parallel Distributed Syst..

[35]  Kang G. Shin,et al.  DIAGNOSIS OF PROCESSORS WITH BYZANTINE FAULTS IN A DISTRIBUTED COMPUTING SYSTEM. , 1987 .

[36]  Sam Toueg,et al.  Unreliable failure detectors for asynchronous systems (preliminary version) , 1991, PODC '91.

[37]  Vaidyanathan Ramaswami,et al.  Analysis of the link error monitoring protocols in the common channel signaling network , 1993, TNET.