Error recovery in critical infrastructure systems

Critical infrastructure applications provide services upon which society depends heavily; such applications require survivability in the face of faults that might cause a loss of service. These applications are themselves dependent on distributed information systems for all aspects of their operation and so survivability of the information systems is an important issue. Fault tolerance is a key mechanism by which survivability can be achieved in these information systems. Much of the literature on fault-tolerant distributed systems focuses on local error recovery by masking the effects of faults. We describe a direction for error recovery in the face of catastrophic faults, where the effects of the faults cannot be masked using available resources. The goal is to provide continued service that is either an alternate or degraded service by reconfiguring the system rather than masking faults. We outline the requirements for a reconfigurable system architecture and present an error recovery system that enables systematic structuring of error recovery specifications and implementations.

[1]  Nancy R. Mead,et al.  Survivable Network Systems: An Emerging Discipline , 1997 .

[2]  Mary Shaw,et al.  Abstractions for Software Architecture and Tools to Support Them , 1995, IEEE Trans. Software Eng..

[3]  David Garlan,et al.  Introduction to the Special Issue on Software Architecture , 1995, IEEE Trans. Software Eng..

[4]  Naranker Dulay,et al.  A constructive development environment for parallel and distributed programs , 1994, Proceedings of 2nd International Workshop on Configurable Distributed Systems.

[5]  GarlanDavid,et al.  Beyond definition/use , 1994 .

[6]  Kevin J. Sullivan,et al.  Information survivability control systems , 1999, Proceedings of the 1999 International Conference on Software Engineering (IEEE Cat. No.99CB37002).

[7]  John C. Knight,et al.  On the Implementation and Use of Ada on Fault-Tolerant Distributed Systems , 1987, IEEE Trans. Software Eng..

[8]  Flaviu Cristian,et al.  Fault-tolerance in air traffic control systems , 1996, TOCS.

[9]  Silvano Maffeis Prianha: A CORBA Tool For High Availability , 1997, Computer.

[10]  Pankaj Jalote,et al.  Fault tolerance in distributed systems , 1994 .

[11]  Edward M. Roche,et al.  Critical Foundations: Protecting America's Infrastructures , 1998 .

[12]  Danny Dolev,et al.  Ensemble Security , 1998 .

[13]  Robbert van Renesse,et al.  Building adaptive systems using ensemble , 1998 .

[14]  James M. Purtilo,et al.  Surgeon: a packager for dynamically reconfigurable distributed applications , 1992, Softw. Eng. J..

[15]  Robbert van Renesse,et al.  A security architecture for fault-tolerant systems , 1994, TOCS.

[16]  Michael K. Reiter,et al.  Integrating security in a group oriented distributed system , 1992, Proceedings 1992 IEEE Computer Society Symposium on Research in Security and Privacy.

[17]  Jeff Magee,et al.  Dynamic Configuration for Distributed Systems , 1985, IEEE Transactions on Software Engineering.

[18]  Kenneth P. Birman The Process Group Approach to Reliable , 2000 .

[19]  Xiaolei Qian,et al.  Correctness and composition of software architectures , 1994, SIGSOFT '94.

[20]  Naranker Dulay,et al.  Specifying Distributed Software Architectures , 1995, ESEC.

[21]  Don Welch Building self-reconfiguring distributed systems using compensating reconfiguration , 1998, Proceedings. Fourth International Conference on Configurable Distributed Systems (Cat. No.98EX159).

[22]  Louise E. Moser,et al.  Surviving Network Partitioning , 1998, Computer.

[23]  Naranker Dulay,et al.  Structuring parallel and distributed programs , 1993, Softw. Eng. J..

[24]  Flaviu Cristian,et al.  Understanding fault-tolerant distributed systems , 1991, CACM.

[25]  Flaviu Cristian Automatic reconfiguration in the presence of failures , 1992, Softw. Eng. J..

[26]  Kenneth P. Birman,et al.  The process group approach to reliable distributed computing , 1992, CACM.

[27]  Walter Mann,et al.  Correction to "Specification and Analysis of System Architecture Using Rapide" , 1995, IEEE Trans. Software Eng..

[28]  James M. Purtilo,et al.  An Environment for Developing Fault-Tolerant Software , 1991, IEEE Trans. Software Eng..

[29]  Christine Hofmeister Dynamic reconfiguration of distributed applications , 1993 .

[30]  John C. Knight,et al.  A Framework for Software Fault Tolerance in Real-Time Systems , 1983, IEEE Transactions on Software Engineering.

[31]  Bruce J. Summers The payment system : design, management, and supervision , 1994 .

[32]  David Garlan,et al.  Beyond definition/use: architectural interconnection , 1994 .

[33]  Silvano Maffeis,et al.  ELECTRA: making distributed programs object-oriented , 1993 .

[34]  Matti A. Hiltunen,et al.  Coyote: a system for constructing fine-grain configurable communication services , 1998, TOCS.

[35]  Flaviu Cristian,et al.  Automatic service availability management in asynchronous distributed systems , 1994, Proceedings of 2nd International Workshop on Configurable Distributed Systems.

[36]  James M. Purtilo,et al.  Planning for change: a reconfiguration language for distributed systems , 1994, Distributed Syst. Eng..

[37]  James M. Purtilo,et al.  The POLYLITH software bus , 1994, TOPL.

[38]  Santosh K. Shrivastava,et al.  An overview of the Arjuna distributed programming system , 1991, IEEE Software.

[39]  Santosh K. Shrivastava,et al.  Structuring Fault-Tolerant Object Systems for Modularity in a Distributed Environment , 1994, IEEE Trans. Parallel Distributed Syst..

[40]  Calton Pu,et al.  Adaptation Space: Surviving Non-maskable Failures , 1998 .

[41]  Duane Andrews Report of the Defense Science Board Task Force on Information Warfare-Defense (IW-D). , 1996 .

[42]  Jeff Magee,et al.  The Evolving Philosophers Problem: Dynamic Change Management , 1990, IEEE Trans. Software Eng..

[43]  R. V. Renesse,et al.  Horus: a flexible group communication system , 1996, CACM.

[44]  Naranker Dulay,et al.  Regis: a constructive development environment for distributed programs , 1994, Distributed Syst. Eng..

[45]  J-C. Laprie,et al.  DEPENDABLE COMPUTING AND FAULT TOLERANCE : CONCEPTS AND TERMINOLOGY , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[46]  Hermann Kopetz,et al.  Fault tolerance, principles and practice , 1990 .