A survey of methods for system-level fault diagnosis

With the increasing need for efficient means of automatic fault diagnosis in large distributed computing systems, system-level fault diagnosis has been a fertile research area for the last few years. There are two types of system-level fault diagnosis methods: classical and adaptive. The classical methods select a set of tests, find results of all these tests, and then process the results to identify the faulty units. The adaptive methods first identify just one fault-free unit and then use it to identify all faulty units. Each of these types of diagnostic methods can assume so called symmetric or asymmetric test invalidation. The former states that tests performed by good units always give correct results, while tests performed by faulty units can produce any results. The latter states that a faulty unit always fails a test, even if the units that influence the test result are faulty. We survey a number of diagnosis methods for each of the two types under both invalidation assumptions. Each of the methods is considered in the context of a certain diagnostic model (such as, e.g., the Boolean n-cube model where processors are represented by nodes and links are represented by edges of a graph). Finally, a comparison of the two types of methods shows that the classical methods are faster (require fewer steps for diagnosis) but less efficient (may misdiagnose more fault-free units as faulty) than the adaptive methods.

[1]  S. Louis Hakimi,et al.  On Adaptive System Diagnosis , 1984, IEEE Transactions on Computers.

[2]  James R. Armstrong,et al.  Fault Diagnosis in a Boolean n Cube Array of Microprocessors , 1981, IEEE Transactions on Computers.

[3]  James E. Smith,et al.  Diagnosis of Systems with Asymmetric Invalidation , 1981, IEEE Transactions on Computers.

[4]  Fabrizio Grandoni,et al.  A Theory of Diagnosability of Digital Systems , 1976, IEEE Transactions on Computers.

[5]  GERNOT METZE,et al.  On the Connection Assignment Problem of Diagnosable Systems , 1967, IEEE Trans. Electron. Comput..