1 Resilience and Safety Resilience is often defined in terms of the ability to continue operations or recover a stable state after a major mishap or event. This definition focuses on the reactive nature of resilience and the ability to recover after an upset. In this chaper, we use a more general definition that includes prevention of upsets. In our conception, resilience is the ability of systems to prevent or adapt to changing conditions in order to maintain (control over) a system property. In this chapter, the property we are concerned about is safety or risk. To ensure safety, the system must be resilient in terms of avoiding failures and losses, as well as responding appropriately after the fact. Major accidents are usually preceded by periods where the organization drifts toward states of increasing risk until the events occur that lead to a loss [12]. Our goal is to determine how to design resilient systems that respond to the pressures and influences causing the drift to states of higher risk or, if that is not possible, to design continuous risk management systems to detect the drift and assist in formulating appropriate responses before the loss event occurs. Our approach rests on modeling and analyzing socio-technical systems and using the information gained in designing the socio-technical system, in evaluating both planned responses to events and suggested organizational policies to prevent adverse organizational drift, and in defining appropriate metrics to detect changes in risk (the equivalent of a “canary in the coal mine”). To be useful, such modeling and analysis must be able to handle complex, tightly coupled systems with distributed human and automated control, advanced technology and software-intensive systems, and the organizational and social aspects of systems. To do this, we use a new model of accident causation (STAMP) based on system theory. STAMP includes non-linear, indirect, and feedback relationships and can better handle the levels of complexity and technological innovation in today’s systems than traditional causality and accident models. In the next section, we briefly describe STAMP. Then we show how STAMP models can be used to design and analyze resilience by applying it to the safety culture of the NASA Space Shuttle program. ∗The research described in this chapter was partially supported by a grant from the NASA/USRA Center for Program/Project Management Research.
[1]
Jacques Leplat,et al.
Occupational accident research and systems approach
,
1984
.
[2]
Nancy G. Leveson,et al.
A new accident model for engineering safer systems
,
2004
.
[3]
Howard E. McCurdy,et al.
Inside NASA: High Technology and Organizational Change in the U.S. Space Program
,
1993
.
[4]
John D. Sterman,et al.
Business dynamics : systems thinking and modelling for acomplex world
,
2002
.
[5]
John D. Sterman,et al.
System Dynamics: Systems Thinking and Modeling for a Complex World
,
2002
.
[6]
J Swanson,et al.
Business Dynamics—Systems Thinking and Modeling for a Complex World
,
2002,
J. Oper. Res. Soc..
[7]
Gustavo Stubrich.
The Fifth Discipline: The Art and Practice of the Learning Organization
,
1993
.
[8]
John Krige.
Inside NASA: High Technology and Organizational Change in the U.S. Space Program, by H.E. McCurdy
,
1997
.
[9]
Nancy G. Leveson,et al.
Applying STAMP in Accident Analysis
,
2003
.
[10]
P. Senge.
The fifth discipline : the art and practice of the learning organization/ Peter M. Senge
,
1991
.
[11]
I. Svedung,et al.
Proactive Risk Management in a Dynamic Society
,
2000
.
[12]
Michael Carr,et al.
Mars Program Independent Assessment Team Report
,
2000
.
[13]
Joseph H. Saleh,et al.
Archetypes for organizational safety
,
2006
.