A Distributed System-Level Diagnosis Algorithm for Arbitrary Network Topologies

A distributed algorithm is described for detecting and diagnosing faulty processors in an arbitrary network. Fault free processors perform simple periodic tests on one another; when a fault is detected or a newly repaired processor joins the network, this new information is disseminated in parallel throughout the network. It is formally proven that the algorithm is correct, and it is also shown that the algorithm is optimal in terms of the time required for all of the fault free processors in the network to learn of a new event. Simulation results are given for arbitrary network topologies. >

[1]  Sampath Rangarajan,et al.  Diagnosing Arbitrarily Connected Parallel Computers with High Probability , 1992, IEEE Trans. Computers.

[2]  S. Louis Hakimi,et al.  An optimal algorithm for distributed system level diagnosis , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[3]  Sampath Rangarajan,et al.  Probabilistic diagnosis algorithms tailored to system topology , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[4]  Sudhakar M. Reddy,et al.  A Diagnosis Algorithm for Distributed Computing Systems with Dynamic Failure and Repair , 1984, IEEE Transactions on Computers.

[5]  Krishan K. Sabnani,et al.  The Comparison Approach to Multiprocessor Fault Diagnosis , 1987, IEEE Transactions on Computers.

[6]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[7]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[8]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[9]  GERNOT METZE,et al.  On the Connection Assignment Problem of Diagnosable Systems , 1967, IEEE Trans. Electron. Comput..

[10]  S. Louis Hakimi,et al.  On Adaptive System Diagnosis , 1984, IEEE Transactions on Computers.

[11]  Richard W. Buskens,et al.  Simulation of the Adapt on-line diagnosis algorithm for general topology networks , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[12]  Sudhakar M. Reddy,et al.  Distributed fault-tolerance for large multiprocessor systems , 1980, ISCA '80.

[13]  Ronald P. Bianchini,et al.  An Adaptive Distributed System-Level Diagnosis Algorithm and Its Implementation , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[14]  Ronald P. Bianchini,et al.  The Adapt2 on-line diagnosis algorithm for general topology networks , 1992, [Conference Record] GLOBECOM '92 - Communications for Global Users: IEEE.

[15]  Sudhakar M. Reddy,et al.  FAULT-DIAGNOSIS IN FULLY DISTRIBUTED SYSTEMS , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[16]  Douglas M. Blough,et al.  Almost certain diagnosis for intermittently faulty systems , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[17]  Che-Liang Yang,et al.  Hybrid Fault Diagnosability with Unreliable Communcation Links , 1988, IEEE Trans. Computers.

[18]  Daniel S. Nydick,et al.  Practical application and implementation of distributed system-level diagnosis theory , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[19]  S. Louis Hakimi,et al.  An Adaptive Algorithm for System Level Diagnosis , 1984, J. Algorithms.

[20]  S. Louis Hakimi,et al.  Characterization of Connection Assignment of Diagnosable Systems , 1974, IEEE Transactions on Computers.