Engineering resilience into safety-critical systems

1 Resilience and Safety Resilience is often defined in terms of the ability to continue operations or recover a stable state after a major mishap or event. This definition focuses on the reactive nature of resilience and the ability to recover after an upset. In this chaper, we use a more general definition that includes prevention of upsets. In our conception, resilience is the ability of systems to prevent or adapt to changing conditions in order to maintain (control over) a system property. In this chapter, the property we are concerned about is safety or risk. To ensure safety, the system must be resilient in terms of avoiding failures and losses, as well as responding appropriately after the fact. Major accidents are usually preceded by periods where the organization drifts toward states of increasing risk until the events occur that lead to a loss [12]. Our goal is to determine how to design resilient systems that respond to the pressures and influences causing the drift to states of higher risk or, if that is not possible, to design continuous risk management systems to detect the drift and assist in formulating appropriate responses before the loss event occurs. Our approach rests on modeling and analyzing socio-technical systems and using the information gained in designing the socio-technical system, in evaluating both planned responses to events and suggested organizational policies to prevent adverse organizational drift, and in defining appropriate metrics to detect changes in risk (the equivalent of a “canary in the coal mine”). To be useful, such modeling and analysis must be able to handle complex, tightly coupled systems with distributed human and automated control, advanced technology and software-intensive systems, and the organizational and social aspects of systems. To do this, we use a new model of accident causation (STAMP) based on system theory. STAMP includes non-linear, indirect, and feedback relationships and can better handle the levels of complexity and technological innovation in today’s systems than traditional causality and accident models. In the next section, we briefly describe STAMP. Then we show how STAMP models can be used to design and analyze resilience by applying it to the safety culture of the NASA Space Shuttle program. ∗The research described in this chapter was partially supported by a grant from the NASA/USRA Center for Program/Project Management Research.