Fault diagnosis in computing networks
暂无分享,去创建一个
The problem of diagnosis of failures in computing systems constructed as interconnections of discrete modules or units is considered. It is assumed that the modules in the network are capable of performing tests upon other modules, and also of being tested by other modules. Two classes of systems are considered, those which incorporate some central, or host facility, and those which are fully distributed. For the former class, extensions to existing system-level fault diagnosis concepts are given, to allow the central facility to diagnose failures based on results collected from tests performed by the nodes of the system. Specific problems considered include the problem of producing a correct diagnosis from test results which do not all reflect the condition of the system at a single point in time, and the problem of diagnosis of failures affecting the communication facilities over which tests are performed. New diagnostic models and measures are introduced to deal with these problems and a number of necessary and sufficient conditions for a system to achieve a given level of diagnosability under these models, are given.
For the class of fully distributed systems, a new diagnostic process, known as self-diagnosability is proposed via which each node in a network can independently diagnose the condition of all other nodes in the system. The need for this type of diagnostic ability is argued via introduction of a notion called distributed fault-tolerance. A number of factors affecting the extent and efficiency of the self-diagnostic process are discussed.