A compiler-based infrastructure for fault-tolerant co-design

The protection of processor-based systems to mitigate the harmful effects of transient faults (hardening) is gaining importance as technology shrinks. Hybrid hardware/software hardening approaches are promising alternatives in the design of such fault tolerant systems. This paper presents a compiler-based infrastructure for facilitating the exploration of the design space between hardware-only and software-only fault tolerant techniques. The compiler design is based on a generic architecture that facilitates the implementation of software-based techniques, providing an uniform isolated-from-target hardening core. In this way, these methods can be implemented in an architecture independent way and can easily integrate new protection mechanisms to automatically produce hardened code. The infrastructure includes a simulator that provides information about memory and execution time overheads to aid the designer in the co-design decisions. The tool-chain is complemented by a hardware fault emulation tool that allows to measure the fault coverage of the different solutions running on the real system. A case study was implemented allowing to evaluate the flexibility of the infrastructure to fit the reliability requirements of the system within their memory and performance restrictions.

[1]  J.N. Tombs,et al.  Radiation Environment Emulation for VLSI Designs: A Low Cost Platform based on Xilinx FPGA's , 2007, 2007 IEEE International Symposium on Industrial Electronics.

[2]  David I. August,et al.  Automatic Instruction-Level Software-Only Recovery , 2006, IEEE Micro.

[3]  R.C. Baumann,et al.  Radiation-induced soft errors in advanced semiconductor technologies , 2005, IEEE Transactions on Device and Materials Reliability.

[4]  Shubhendu S. Mukherjee,et al.  Detailed design and evaluation of redundant multithreading alternatives , 2002, ISCA.

[5]  Irith Pomeranz,et al.  Transient-fault recovery for chip multiprocessors , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[6]  J. N. Tombs,et al.  Noninvasive Fault Classification, Robustness and Recovery Time Measurement in Microprocessor-Type Architectures Subjected to Radiation-Induced Errors , 2009, IEEE Transactions on Instrumentation and Measurement.

[7]  Massimo Violante,et al.  A New Approach to Software-Implemented Fault Tolerance , 2004, J. Electron. Test..

[8]  David I. August,et al.  Software-controlled fault tolerance , 2005, TACO.

[9]  Hardening development environment for embedded systems , 2010 .

[10]  Edward J. McCluskey,et al.  Error detection by duplicated instructions in super-scalar processors , 2002, IEEE Trans. Reliab..

[11]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[12]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[13]  J. Eccles,et al.  International electrotechnical commission , 1955, Journal of the American Institute of Electrical Engineers.

[14]  S. Katkoori,et al.  Selective triple Modular redundancy (STMR) based single-event upset (SEU) tolerant synthesis for FPGAs , 2004, IEEE Transactions on Nuclear Science.

[15]  Edward J. McCluskey,et al.  Control-flow checking by software signatures , 2002, IEEE Trans. Reliab..

[16]  Fabian Vargas,et al.  A new hybrid fault detection technique for systems-on-a-chip , 2006, IEEE Transactions on Computers.

[17]  Lorenzo Alvisi,et al.  Modeling the effect of technology trends on the soft error rate of combinational logic , 2002, Proceedings International Conference on Dependable Systems and Networks.

[18]  Marco Torchiano,et al.  A source-to-source compiler for generating dependable software , 2001, Proceedings First IEEE International Workshop on Source Code Analysis and Manipulation.

[19]  David I. August,et al.  Design and evaluation of hybrid fault-detection systems , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[20]  Edward J. McCluskey,et al.  ED4I: Error Detection by Diverse Data and Duplicated Instructions , 2002, IEEE Trans. Computers.

[21]  Irith Pomeranz,et al.  Transient-fault recovery using simultaneous multithreading , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[22]  Shubhendu S. Mukherjee,et al.  Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[23]  Massimo Violante,et al.  A new software-based technique for low-cost fault-tolerant application , 2003, Annual Reliability and Maintainability Symposium, 2003..

[24]  Edward J. McCluskey,et al.  Concurrent Error Detection Using Watchdog Processors - A Survey , 1988, IEEE Trans. Computers.

[25]  R. Edwards,et al.  Technical standard for atmospheric radiation single event effects, (SEE) on avionics electronics , 2004, 2004 IEEE Radiation Effects Data Workshop (IEEE Cat. No.04TH8774).