Self-testing software probe system for failure detection and diagnosis

A key problem in today's complex software systems is software failure detection and isolation given that most software failures are only partial, and if efficiently diagnosed, isolated and recovered, could avert a total outage. The probe detects failed software components in a running software system by requesting service, or a certain level of service, from a set of functions, modules and/or subsystems (target) and checking the response to the request. The objective is to localize the failure only up to the level of a target, while, achieving a high degree of efficiency and confidence in the process. Targets can be identified at different levels or layers in the software. The choice is based on the granularity of fault detection that is desired, taken in consideration with the level at which recovery is implemented. The implementation of the probe system is made self testing against any single failure in its operational components, using the idea of a null probe. The probe system has been designed, taking advantage of the latency characteristics of errors, to provide a low-overhead mechanism. The ideas are implementable in either a single or multiple computer system.

[1]  Daniel P. Siewiorek,et al.  Workload, Performance, and Reliability of Digital Computing Systems. , 1980 .

[2]  Ravishankar K. Iyer,et al.  A Statistical Failure/Load Relationship: Results of a Multicomputer Study , 1982, IEEE Transactions on Computers.

[3]  Ytzhak H. Levendel,et al.  Defects and reliability analysis of large software systems: field experience , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[4]  Ravishankar K. Iyer,et al.  Measurement-Based Analysis of Error Latency , 1987, IEEE Transactions on Computers.

[5]  Mark Sullivan,et al.  Software defects and their impact on system availability-a study of field failures in operating systems , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[6]  Mark Sullivan,et al.  A comparison of software defects in database management systems and operating systems , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[7]  Brian Randell System structure for software fault tolerance , 1975 .