Temporal Fault Injection Considerations in Resilience Quantification

Resilience models assess a system’s ability to withstand disruption by quantifying the value of metrics (e.g. expected cost or loss) over time. When such a metric is the result of injecting faults in a dynamic model over an interval of time, it is important that it represent the statistical expectation of fault responses rather than a single response. Since fault responses vary over fault injection times, representing the statistical expectation of responses requires sampling a number of points. However, fault models are often built around computationally expensive dynamic simulations, and it is desirable to be able to iterate over designs as quickly as possible to improve system resilience. With this in mind, this paper explores approaches to sample fault injection times to minimize computational cost while accurately representing the expectation of fault resilience metrics over the set possible occurrence times. Two general approaches are presented: an a priori approach that attempts to minimize error without knowing the underlying cost function, and an a posteriori approach that minimizes error when the cost function is known. Among a priori methods, numerical integration minimizes error and computational time compared to Monte Carlo sampling, however both are prone to error when the metric’s fault response curve is discontinuous. While a posteriori approaches can locate and correct for these discontinuities, the resulting error reduction is not robust to design changes that shift the underlying location of discontinuities. The ultimate decision to use an a priori or a posteriori approach to quantify resilience is thus dependent on a number of considerations, including computational cost, the robustness of the approximation to design changes, and the underlying form of the resilience function.