A Distributed Algorithm for Fault Diagnosis in Systems with Soft Failures

The problem of diagnosis of soft failures at the system level in large and fully distributed networks of processors (or units) is considered. A system model in which each of the network's units is assumed to possess the ability to test (or evaluate) certain other units for the presence of failures is employed. Using this model and assuming that the total number of faulty units does not exceed a given bound, a distributed algorithm is presented which allows all the fault-free units to independently converge to correct and consistent diagnoses of the system status. This algorithm is also shown to be applicable to bounded fault situations where both units and communication links can be faulty. >

[1]  Leslie Lamport,et al.  Reaching Agreement in the Presence of Faults , 1980, JACM.

[2]  Sudhakar M. Reddy,et al.  Distributed fault-tolerance for large multiprocessor systems , 1980, ISCA '80.

[3]  Che-Liang Yang,et al.  A Fault Identification Algorithm for ti-Diagnosable Systems , 1986, IEEE Transactions on Computers.

[4]  Gerald M. Masson,et al.  Diagnosable Systems for Intermittent Faults , 1978, IEEE Transactions on Computers.

[5]  Dhiraj K. Pradhan,et al.  Dynamic Testing Strategy for Distributed Systems , 1989, IEEE Trans. Computers.

[6]  Gerald M. Masson,et al.  An 0(n2.5) Fault Identification Algorithm for Diagnosable Systems , 1984, IEEE Transactions on Computers.

[7]  Danny Dolev,et al.  The Byzantine Generals Strike Again , 1981, J. Algorithms.

[8]  GERNOT METZE,et al.  On the Connection Assignment Problem of Diagnosable Systems , 1967, IEEE Trans. Electron. Comput..

[9]  S. Louis Hakimi,et al.  Characterization of Connection Assignment of Diagnosable Systems , 1974, IEEE Transactions on Computers.