Practical application and implementation of distributed system-level diagnosis theory

A DSD (distributed self-diagnosing) project that consists of the implementation of a distributed self-diagnosis algorithm and its application to distributed computer networks is presented. The EVENT-SELF algorithm presented combines the rigor associated with theoretical results with the resource limitations associated with actual systems. Resource limitations identified in real systems include available message capacity for the communication network and limited processor execution speed. The EVENT-SELF algorithm differs from previously published algorithms by adopting an event-driven approach to self-diagnosability. Algorithm messages are reduced to those messages required to indicate changes in system those messages required to indicate changes in system state. Practical issues regarding the CMU-ECE DSD implementation are considered. These issues include the reconfiguration of the testing subnetwork for environments in which processors can be added and removed. One of the goals of this work is to utilize the developed CMU-ECE DSD system as an experimental test-bed environment for distributed applications.<<ETX>>

[1]  Jon Postel,et al.  User Datagram Protocol , 1980, RFC.

[2]  Sudhakar M. Reddy,et al.  A Diagnosis Algorithm for Distributed Computing Systems with Dynamic Failure and Repair , 1984, IEEE Transactions on Computers.

[3]  John Paul Shen,et al.  Continuous signature monitoring: efficient concurrent-detection of processor control errors , 1988, International Test Conference 1988 Proceeding@m_New Frontiers in Testing.

[4]  GERNOT METZE,et al.  On the Connection Assignment Problem of Diagnosable Systems , 1967, IEEE Trans. Electron. Comput..

[5]  S. L. Hakimi,et al.  System-level fault diagnosis: A survey , 1987 .

[6]  S. Louis Hakimi,et al.  Characterization of Connection Assignment of Diagnosable Systems , 1974, IEEE Transactions on Computers.

[7]  Jon Postel,et al.  Internet Protocol , 1981, RFC.

[8]  Che-Liang Yang,et al.  Hybrid Fault Diagnosability with Unreliable Communcation Links , 1988, IEEE Trans. Computers.

[9]  Sudhakar M. Reddy,et al.  Distributed fault-tolerance for large multiprocessor systems , 1980, ISCA '80.

[10]  Douglas M. Blough,et al.  Almost certain diagnosis for intermittently faulty systems , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.