Scheduling Fault Recovery Operations for Time-Critical Applications

This paper introduces algorithms for scheduling fault recovery operations on systems which must preserve the timing correctness of critical application tasks in the presence of faults. The algorithms are based on methods to reserve time for the processing of recovery tasks at the design stage. This allows recovery tasks to be scheduled with very low run-time overhead, complementing or reducing the need for hardware replication to support dependable operation. Although previous work has advocated the use of reservation methods, there exists no formal methodology for allocating such time. A methodology is developed which facilitates the difficult task of verifying the timing correctness of a desired reservation strategy. In addition, simulation results are presented which give insight into the effectiveness of different reservation strategies in averting timing failures under a variety of transient recovery loads.1

[1]  C. Douglas Locke,et al.  Building a predictable avionics platform in Ada: a case study , 1991, [1991] Proceedings Twelfth Real-Time Systems Symposium.

[2]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[3]  John D. Musa,et al.  A theory of software reliability and its application , 1975, IEEE Transactions on Software Engineering.

[4]  Chung Laung Liu,et al.  Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment , 1989, JACM.

[5]  William Pugh,et al.  A partial evaluator for the Maruti hard real-time system , 1991, [1991] Proceedings Twelfth Real-Time Systems Symposium.

[6]  H. Hecht,et al.  Fault-Tolerant Software for Real-Time Applications , 1976, CSUR.

[7]  Santosh K. Shrivastava,et al.  Reliable Computer Systems , 1985, Texts and Monographs in Computer Science.

[8]  Hermann Kopetz,et al.  Distributed fault-tolerant real-time systems: the Mars approach , 1989, IEEE Micro.

[9]  Brian Randell System structure for software fault tolerance , 1975 .

[10]  T Basil Smith The fault-tolerant multiprocessor computer , 1986 .

[11]  Vivek Nirkhe Application of partial evaluation to hard real-time programming , 1992 .

[12]  Joseph Y.-T. Leung,et al.  On the complexity of fixed-priority scheduling of periodic, real-time tasks , 1982, Perform. Evaluation.

[13]  Shirish S. Sathaye,et al.  Fixed Priority Scheduling with Limited Priority Levels , 1995, IEEE Trans. Computers.

[14]  R. H. Campbell,et al.  A fault-tolerant scheduling problem , 1989, IEEE Transactions on Software Engineering.

[15]  Lalit M. Patnaik,et al.  Workload redistribution for fault-tolerance in a hard real-time distributed computing system , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[16]  Sandra Ramos Thuel,et al.  Enhancing fault tolerance of real-time systems through time redundancy , 1993 .

[17]  Roy H. Campbell,et al.  APPLICATION OF THE FAULT-TOLERANT DEADLINE MECHANISM TO A SATELLITE ON-BOARD COMPUTER SYSTEM. , 1980 .

[18]  John P. Lehoczky,et al.  The rate monotonic scheduling algorithm: exact characterization and average case behavior , 1989, [1989] Proceedings. Real-Time Systems Symposium.

[19]  William C. Carter,et al.  Reliability Modeling for Fault-Tolerant Computers , 1971, IEEE Transactions on Computers.

[20]  Geneva G. Belford,et al.  SIMULATIONS OF A FAULT-TOLERANT DEADLINE MECHANISM. , 1979 .

[21]  Kang G. Shin,et al.  On Scheduling Tasks with a Quick Recovery from Failure , 1986, IEEE Transactions on Computers.