Hardware/software optimization of error detection implementation for real-time embedded systems

This paper presents an approach to system-level optimization of error detection implementation in the context of fault-tolerant real-time distributed embedded systems used for safety-critical applications. An application is modeled as a set of processes communicating by messages. Processes are mapped on computation nodes connected to the communication infrastructure. To provide resiliency against transient faults, efficient error detection and recovery techniques have to be employed. Our main focus in this paper is on the efficient implementation of the error detection mechanisms. We have developed techniques to optimize the hardware/software implementation of error detection, in order to minimize the global worst-case schedule length, while meeting the imposed hardware cost constraints and tolerating multiple transient faults. We present two design optimization algorithms which are able to find feasible solutions given a limited amount of resources: the first one assumes that, when implemented in hardware, error detection is deployed on static reconfigurable FPGAs, while the second one considers partial dynamic reconfiguration capabilities of the FPGAs.

[1]  Petru Eles,et al.  Design Optimization of Time- and Cost-Constrained Fault-Tolerant Embedded Systems With Checkpointing and Replication , 2009, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[2]  Cristian Constantinescu,et al.  Trends and Challenges in VLSI Circuit Reliability , 2003, IEEE Micro.

[3]  Petru Eles,et al.  Synthesis of Fault-Tolerant Schedules with Transparency/Performance Trade-offs for Distributed Embedded Systems , 2006, Proceedings of the Design Automation & Test in Europe Conference.

[4]  Nikil D. Dutt,et al.  Physically-aware HW-SW partitioning for reconfigurable architectures with partial dynamic reconfiguration , 2005, Proceedings. 42nd Design Automation Conference, 2005..

[5]  Tobias Becker,et al.  Modular partial reconfigurable in Virtex FPGAs , 2005, International Conference on Field Programmable Logic and Applications, 2005..

[6]  Petru Eles,et al.  Scheduling with bus access optimization for distributed embedded systems , 2000, IEEE Trans. Very Large Scale Integr. Syst..

[7]  Massimo Violante,et al.  Software and Hardware Techniques for SEU Detection in IP Processors , 2008, J. Electron. Test..

[8]  Michael J. Wirthlin,et al.  The reliability of FPGA circuit designs in the presence of radiation induced configuration upsets , 2003, 11th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 2003. FCCM 2003..

[9]  Ravishankar K. Iyer,et al.  Automated Derivation of Application-Aware Error Detectors Using Static Analysis: The Trusted Illiac Approach , 2011, IEEE Transactions on Dependable and Secure Computing.

[10]  David I. August,et al.  Software-controlled fault tolerance , 2005, TACO.

[11]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[12]  Ravishankar K. Iyer,et al.  Application-based metrics for strategic placement of detectors , 2005, 11th Pacific Rim International Symposium on Dependable Computing (PRDC'05).

[13]  Roberto Cordone,et al.  Partitioning and Scheduling of Task Graphs on Partially Dynamically Reconfigurable FPGAs , 2009, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[14]  F. Glover,et al.  In Modern Heuristic Techniques for Combinatorial Problems , 1993 .

[15]  Jürgen Becker,et al.  A framework for dynamic 2D placement on FPGAs , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[16]  Shuguang Feng,et al.  Cost-efficient soft error protection for embedded microprocessors , 2006, CASES '06.

[17]  Luigi Carro,et al.  Designing fault tolerant systems into SRAM-based FPGAs , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[18]  Ravishankar K. Iyer,et al.  An end-to-end approach for the automatic derivation of application-aware error detectors , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[19]  Hermann Kopetz,et al.  The time-triggered architecture , 1998, Proceedings First International Symposium on Object-Oriented Real-Time Distributed Computing (ISORC '98).

[20]  Petru Eles,et al.  Design optimization of time- and cost-constrained fault-tolerant distributed embedded systems , 2005, Design, Automation and Test in Europe.

[21]  Hans A. Hansson,et al.  Towards a dependable component technology for embedded system applications , 2005, 10th IEEE International Workshop on Object-Oriented Real-Time Dependable Systems.

[22]  Jakob Engblom,et al.  The worst-case execution-time problem—overview of methods and survey of tools , 2008, TECS.

[23]  Mahmut T. Kandemir,et al.  Compiler-assisted soft error detection under performance and energy constraints in embedded systems , 2009, TECS.

[24]  Tongquan Wei,et al.  Online Task-Scheduling for Fault-Tolerant Low-Energy Real-Time Systems , 2006, 2006 IEEE/ACM International Conference on Computer Aided Design.