A Generic Fault-Tolerant Architecture for Real-Time Dependable Systems

The design of computer systems to be embedded in critical real-time applications is a complex task. Such systems must not only guarantee to meet hard real-time deadlines imposed by their physical environment, they must guarantee to do so dependably, despite both physical faults (in hardware) and design faults (in hardware or software). A fault-tolerance approach is mandatory for these guarantees to be commensurate with the safety and reliability requirements of many life- and mission-critical applications. A Generic Fault-Tolerant Architecture for Real-Time Dependable Systems explains the motivations and the results of a collaborative project(*), whose objective was to significantly decrease the lifecycle costs of such fault-tolerant systems. The end-user companies participating in this project currently deploy fault-tolerant systems in critical railway, space and nuclear-propulsion applications. However, these are proprietary systems whose architectures have been tailored to meet domain-specific requirements. This has led to very costly, inflexible, and often hardware-intensive solutions that, by the time they are developed, validated and certified for use in the field, can already be out-of-date in terms of their underlying hardware and software technology. The project thus designed a generic fault-tolerant architecture with two dimensions of redundancy and a third multi-level integrity dimension for accommodating software components of different levels of criticality. The architecture is largely based on commercial off-the-shelf (COTS) components and follows a software-implemented approach so as to minimise the need for special hardware. Using an associated development and validation environment, system developers may configure and validate instances of the architecture that can be shown to meet the very diverse requirements of railway, space, nuclear-propulsion and other critical real-time applications. This book describes the rationale of the generic architecture, the design and validation of its communication, scheduling and fault-tolerance components, and the tools that make up its design and validation environment. The book concludes with a description of three prototype systems that have been developed following the proposed approach. (*) Esprit project No. 20716: GUARDS: a Generic Upgradable Architecture for Real-time Dependable Systems.

[1]  Giorgio Mongardi DEPENDABLE COMPUTING FOR RAILWAY CONTROL SYSTEMS , 1993 .

[2]  Lui Sha,et al.  Priority Inheritance Protocols: An Approach to Real-Time Synchronization , 1990, IEEE Trans. Computers.

[3]  David S. Johnson,et al.  Computers and In stractability: A Guide to the Theory of NP-Completeness. W. H Freeman, San Fran , 1979 .

[4]  Yves Dutuit,et al.  Dependability modelling and evaluation by using stochastic Petri nets: application to two test cases , 1997 .

[5]  Jean-Claude Laprie,et al.  Dependability of computer systems: concepts, limits, improvements , 1995, Proceedings of Sixth International Symposium on Software Reliability Engineering. ISSRE'95.

[6]  Kang G. Shin,et al.  Optimal multiple syndrome probabilistic diagnosis , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[7]  Philip M. Thambidurai,et al.  Interactive consistency with multiple failure modes , 1988, Proceedings [1988] Seventh Symposium on Reliable Distributed Systems.

[8]  H. R. Simpson Four-slot fully asynchronous communication mechanism , 1990 .

[9]  Hermann Kopetz Component-based design of large distributed real-time systems , 1998 .

[10]  Mathai Joseph,et al.  Finding Response Times in a Real-Time System , 1986, Comput. J..

[11]  Patrick Lincoln,et al.  A Formally Verified Algorithm for Interactive Consistency Under a Hybrid Fault Model , 1993, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[12]  Ravishankar K. Iyer,et al.  Automatic Recognition of Intermittent Failures: An Experimental Study of Field Data , 1990, IEEE Trans. Computers.

[13]  H.J.P. Timmermans,et al.  Modelling for evaluation of urban freight transport , 1998 .

[14]  Friedrich W. von Henke,et al.  Mechanical Verification of Clock Synchronization Algorithms , 1998, FTRTFT.

[15]  Daniel P. Siewiorek,et al.  Error log analysis: statistical modeling and heuristic trend analysis , 1990 .

[16]  Enrico Tronci,et al.  A Symbolic Model Checker for ACTL , 1998, FM-Trends.

[17]  Marie-Christine Bansse Comité européen de normalisation électrotechnique (CENELEC) , 1989 .

[18]  J. Eccles,et al.  International electrotechnical commission , 1955, Journal of the American Institute of Electrical Engineers.

[19]  David Powell,et al.  Distributed fault tolerance: lessons from Delta-4 , 1994, IEEE Micro.

[20]  Joseph Y.-T. Leung,et al.  On the complexity of fixed-priority scheduling of periodic, real-time tasks , 1982, Perform. Evaluation.

[21]  Rocco De Nicola,et al.  Action versus State based Logics for Transition Systems , 1990, Semantics of Systems of Concurrent Processes.

[22]  Sushil Jajodia,et al.  Integrating an object-oriented data model with multilevel security , 1990, Proceedings. 1990 IEEE Computer Society Symposium on Research in Security and Privacy.

[23]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[24]  Jean-Charles Fabre,et al.  A Metaobject Architecture for Fault-Tolerant Distributed Systems: The FRIENDS Approach , 1998, IEEE Trans. Computers.

[25]  Leslie Lamport,et al.  Reaching Agreement in the Presence of Faults , 1980, JACM.

[26]  Ronald L. Rivest,et al.  The MD5 Message-Digest Algorithm , 1992, RFC.

[27]  Andrea Bondavalli,et al.  Hierarchical modeling and evaluation of phased-mission systems , 1999 .

[28]  Peter J. Robinson Hierarchical object-oriented design , 1992 .

[29]  P. M. Melliar-Smith,et al.  Formal Specification and Mechanical Verification of SIFT: A Fault-Tolerant Flight Control System , 1982, IEEE Transactions on Computers.

[30]  J. H. Lala,et al.  Architectural principles for safety-critical real-time applications , 1994, Proc. IEEE.

[31]  Ram Chillarege,et al.  Design for fault-tolerance in system ES model 900 , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[32]  A. Fantechi,et al.  Formal description and validation for an integrity policy supporting multiple levels of criticality , 1999, Dependable Computing for Critical Applications 7.

[33]  Karama Kanoun,et al.  Fault-tolerant system dependability-explicit modeling of hardware and software component-interactions , 2000, IEEE Trans. Reliab..

[34]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[35]  P. M. Melliar-Smith,et al.  Synchronizing clocks in the presence of faults , 1985, JACM.

[36]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[37]  Chung Laung Liu,et al.  Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment , 1989, JACM.

[38]  Jaynarayan H. Lala,et al.  FAULT-TOLERANT PARALLEL PROCESSOR , 1991 .

[39]  Yves Deswarte,et al.  Supporting multiple levels of criticality , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[40]  Sam Toueg,et al.  Optimal clock synchronization , 1985, PODC '85.

[41]  M.A. Qureshi,et al.  The UltraSAN Modeling Environment , 1995, Perform. Evaluation.

[42]  Adi Shamir,et al.  A method for obtaining digital signatures and public-key cryptosystems , 1978, CACM.

[43]  Parameswaran Ramanathan,et al.  Fault-tolerant clock synchronization in distributed systems , 1990, Computer.

[44]  Natarajan Shankar,et al.  PVS: Combining Specification, Proof Checking, and Model Checking , 1996, FMCAD.

[45]  Nandakurnar N. Tendolkar,et al.  Automated diagnostic methodology for the IBM 3081 processor complex , 1982 .

[46]  Jaynarayan H. Lala,et al.  Hardware and software fault tolerance: a unified architectural approach , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[47]  Andrea Bondavalli,et al.  A new heuristic to discriminate between transient and intermittent faults , 1998, Proceedings Third IEEE International High-Assurance Systems Engineering Symposium (Cat. No.98EX231).

[48]  William I. Nowicki,et al.  NFS: Network File System Protocol specification , 1989, RFC.

[49]  Andy J. Wellings,et al.  Real-time scheduling in a generic fault-tolerant architecture , 1998, Proceedings 19th IEEE Real-Time Systems Symposium (Cat. No.98CB36279).

[50]  Karama Kanoun,et al.  Modeling the dependability of CAUTRA, a subset of the French air traffic control system , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.