Synthesis of fault-tolerant embedded systems with checkpointing and replication

We present an approach to the synthesis of fault-tolerant hard real-time systems for safety-critical applications. We use checkpointing with rollback recovery and active replication for tolerating transient faults. Processes are statically scheduled and communications are performed using the time-triggered protocol. Our synthesis approach decides the assignment of fault-tolerance policies to processes, the optimal placement of checkpoints and the mapping of processes to processors such that transient faults are tolerated and the timing constraints of the application are satisfied. We present several synthesis algorithms which are able to find fault-tolerant implementations given a limited amount of resources. The developed algorithms are evaluated using extensive experiments, including a real-life example

[1]  Alberto L. Sangiovanni-Vincentelli,et al.  Fault-tolerant deployment of embedded software for cost-sensitive real-time feedback-control applications , 2004, Proceedings Design, Automation and Test in Europe Conference and Exhibition.

[2]  Alan Burns,et al.  Feasibility analysis of fault-tolerant real-time task sets , 1996, Proceedings of the Eighth Euromicro Workshop on Real-Time Systems.

[3]  Kang G. Shin,et al.  A Fault-Tolerant Scheduling Algorithm for Real-Time Periodic Tasks with Possible Software Faults , 2003, IEEE Trans. Computers.

[4]  Ramesh Karri,et al.  Coactive scheduling and checkpoint determination during high level synthesis of self-recovering microarchitectures , 1994, IEEE Trans. Very Large Scale Integr. Syst..

[5]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[6]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[7]  Stefan Poledna,et al.  The XBW model for dependable real-time systems , 1998, Proceedings 1998 International Conference on Parallel and Distributed Systems (Cat. No.98TB100250).

[8]  Nagarajan Kandasamy,et al.  Transparent recovery from intermittent faults in time-triggered distributed systems , 2003 .

[9]  Nagarajan Kandasamy,et al.  Dependable Communication Synthesis for Distributed Embedded Systems , 2003, SAFECOMP.

[10]  Gerhard Fohler Adaptive fault-tolerance with statically scheduled real-time systems , 1997, Proceedings Ninth Euromicro Workshop on Real Time Systems.

[11]  Petru Eles,et al.  Scheduling with bus access optimization for distributed embedded systems , 2000, IEEE Trans. Very Large Scale Integr. Syst..

[12]  Petru Eles,et al.  Analysis and Synthesis of Distributed Real-Time Embedded Systems , 2004, Springer US.

[13]  Hermann Kopetz,et al.  Distributed fault-tolerant real-time systems: the Mars approach , 1989, IEEE Micro.

[14]  Isabelle Puaut,et al.  Scheduling fault-tolerant distributed hard real-time tasks independently of the replication strategies , 1999, Proceedings Sixth International Conference on Real-Time Computing Systems and Applications. RTCSA'99 (Cat. No.PR00306).

[15]  Alan Burns,et al.  Analysis of checkpointing for schedulability of real-time systems , 1997, Proceedings Fourth International Workshop on Real-Time Computing Systems and Applications.

[16]  Gerhard Fohler,et al.  Joint scheduling of distributed complex periodic and hard aperiodic tasks in statically scheduled systems , 1995, Proceedings 16th IEEE Real-Time Systems Symposium.

[17]  K. Hoyme,et al.  SAFEbus (for avionics) , 1993, IEEE Aerospace and Electronic Systems Magazine.

[18]  Hermann Kopetz,et al.  Real-time systems , 2018, CSC '73.

[19]  Yves Sorel,et al.  Off-line real-time fault-tolerant scheduling , 2001, Proceedings Ninth Euromicro Workshop on Parallel and Distributed Processing.

[20]  Luigi V. Mancini,et al.  Scheduling algorithms for fault-tolerance in hard-real-time systems , 1994, Real-Time Systems.

[21]  Hermann Kopetz,et al.  The time-triggered architecture , 1998, Proceedings First International Symposium on Object-Oriented Real-Time Distributed Computing (ISORC '98).

[22]  Petru Eles,et al.  Design optimization of time- and cost-constrained fault-tolerant distributed embedded systems , 2005, Design, Automation and Test in Europe.