Measuring lead times for failure prediction

Failure prediction anticipates system failures before they occur so that preemptive action can be taken, thus improving the dependability of the system. For effective failure prediction, the lead time, i.e., the time between the occurrence of a fault and the appearance of a system failure, must accommodate both the prediction step and the preemptive action that is triggered after it. Lead time is intrinsically related to complex error propagation phenomena, which depends on the software architecture of the target system (i.e., the system where failures are predicted) and on the dynamics of such software. For this reason, lead time is highly dependent on the specific nature and intrinsic details of the target system, which means that determining the distribution of lead time for a particular target system should be the very first step in developing failure prediction models. Furthermore, this step is of utmost importance, as it may decide whether failure prediction is viable for a given target system or not. For example, if lead time in a given target system is very short, it means that failure prediction is not viable in such system and classic (and expensive) fault tolerance should be applied. This paper proposes a method for obtaining the lead time distribution of a system using fault injection and presents a practical experiment illustrating such method for a virtualized system. The results suggest that the lead times of failures caused by software faults are usually much larger than those of failures caused by hardware faults.