A Conceptual Framework for System Fault Tolerance

Abstract : A major problem in transitioning fault tolerance practices to the practitioner community is a lack of a common view of what fault tolerance is, and how it can help in the design of reliable computer systems. This document takes a step towards making fault tolerance more understandable by proposing a conceptual framework. The framework provides a consistent vocabulary for fault tolerance concepts, discusses how systems fail, describes commonly used mechanisms for making systems fault tolerant, and provides some rules for developing fault tolerant systems.