Fault Tolerance, Reliability and Testability for Distributed Systems.

Abstract : A growing need exists for improved fault tolerance, reliability, and testability in distributed systems which support Command, Control and Communications and Intelligence (C3I) activities. The objective of this study is to provide a foundation for the development of design measures and guidelines for the design of fault tolerant systems. Taxonomies of fault tolerance and distributed systems are developed, and typical Air Force C3I needs in both fault tolerant and distributed computer systems are characterized. Reliability and availability experience for ten typical computer systems is reported in a consistent format, and the data are analyzed from the perspective of a distributed system user. Previous work on the identification of problems in distributed systems and design methods for their solutions is discussed. Key issues in the design of fault tolerant distributed systems are identified. Fault location techniques for specific computer configurations found in C3I applications are described in detail. The study is a continuing effort, and a comprehensive design methodology will be developed based upon the material presented in this report. (Author)