The upset-fault-observer: A concept for self-healing adaptive fault tolerance

Advancing integration reaching atomic-scales makes components highly defective and unstable during lifetime. This demands paradigm shifts in electronic systems design. FPGAs are particularly sensitive to cosmic and other kinds of radiations that produce single-event-upsets (SEU) in configuration and internal memories. Typical fault-tolerance (FT) techniques combine triple-modular-redundancy (TMR) schemes with run-time-reconfiguration (RTR). However, even the most successful approaches disregard the low suitability of fine-grain redundancy in nano-scale design, poor scalability and programmability of application specific architectures, small performance-consumption ratio of board-level designs, or scarce optimization capability of rigid redundancy structures. In that context, we introduce an innovative solution that exploits the flexibility, reusability, and scalability of a modular RTR SoC approach and reuse existing RTR IP-cores in order to assemble different TMR schemes during run-time. Thus, the system can adaptively trigger the adequate self-healing strategy according to execution environment metrics and user-defined goals. Specifically the paper presents: (a) the upset-fault-observer (UFO), an innovative run-time self-test and recovery strategy that delivers FT on request over several function cores but saves the redundancy scalability cost by running periodic reconfigurable TMR scan-cycles, (b) run-time reconfigurable TMR schemes and self-repair mechanisms, and (c) an adaptive software organization model to manage the proposed FT strategies.

[1]  Dhiraj K. Pradhan,et al.  Processor- and memory-based checkpoint and rollback recovery , 1993, Computer.

[2]  Elena Dubrova,et al.  Fault-Tolerant Design , 2013 .

[3]  John Ayer Dual Use of ICAP with SEM Controller , 2011 .

[4]  A. DeHon,et al.  Architecture approaching the atomic scale , 2007, ESSCIRC 2007 - 33rd European Solid-State Circuits Conference.

[5]  Johnny Öberg,et al.  Towards the generic reconfigurable accelerator: Algorithm development, core design, and performance analysis , 2013, 2013 International Conference on Reconfigurable Computing and FPGAs (ReConFig).

[6]  Martin Straka,et al.  Fault Tolerant Structure for SRAM-Based FPGA via Partial Dynamic Reconfiguration , 2010, DSD 2010.

[7]  Yongbin Zhou,et al.  Maximizing transient availability of real-time Onboard Reconfigurable Processing Platforms: An analytical redundancy inspired approach , 2008, 2008 International Conference on Information and Automation.

[8]  Jürgen Becker,et al.  Run-time reconfigurabilility and other future trends , 2006, SBCCI '06.

[9]  John P. Hayes,et al.  Low-cost sensing with ring oscillator arrays for healthier reconfigurable systems , 2012, TRETS.

[10]  Marco D. Santambrogio,et al.  TMR and Partial Dynamic Reconfiguration to mitigate SEU faults in FPGAs , 2007, 22nd IEEE International Symposium on Defect and Fault-Tolerance in VLSI Systems (DFT 2007).

[11]  Julia Mathew,et al.  Fault Tolerance Technique for DynamicallyReconfigurable Processor , 2014 .

[12]  Maya Gokhale,et al.  Nanocomputing in the presence of defects and faults: a survey , 2004 .

[13]  Johnny Öberg,et al.  The RecoBlock SoC platform: A flexible array of reusable Run-Time-Reconfigurable IP-blocks , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[14]  Scott Hauck,et al.  Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation , 2007 .

[15]  Jürgen Becker,et al.  Guest Editorial ARC 2009 , 2010, TRETS.

[16]  Jeffrey O. Kephart,et al.  The Vision of Autonomic Computing , 2003, Computer.

[17]  Zdenek Kotásek,et al.  Dependability Analysis of Fault Tolerant Systems Based on Partial Dynamic Reconfiguration Implemented into FPGA , 2012, 2012 15th Euromicro Conference on Digital System Design.

[18]  Sébastien Pillement,et al.  Low-overhead fault-tolerance technique for a dynamically reconfigurable softcore processor , 2013, IEEE Transactions on Computers.

[19]  Alan D. George,et al.  Reconfigurable Fault Tolerance: A Comprehensive Framework for Reliable and Adaptive FPGA-Based Space Computing , 2012, TRETS.

[20]  Jürgen Becker,et al.  A study on fine granular fault tolerance methodologies for FPGAs , 2011, 6th International Workshop on Reconfigurable Communication-Centric Systems-on-Chip (ReCoSoC).

[21]  Sandi Habinc,et al.  Dynamic Partial Reconfiguration in Space Applications , 2009, 2009 NASA/ESA Conference on Adaptive Hardware and Systems.

[22]  Gabriel L. Nazar,et al.  Radiation and Fault Injection Testing of a Fine-Grained Error Detection Technique for FPGAs , 2013, IEEE Transactions on Nuclear Science.

[23]  Zdenek Kotásek,et al.  SEU Simulation Framework for Xilinx FPGA: First Step towards Testing Fault Tolerant Systems , 2011, 2011 14th Euromicro Conference on Digital System Design.

[24]  André DeHon,et al.  Law of large numbers system design , 2004 .

[25]  Scott Hauck,et al.  Performance of partial reconfiguration in FPGA systems: A survey and a cost model , 2011, TRETS.

[26]  Miodrag Potkonjak,et al.  Enhanced FPGA reliability through efficient run-time fault reconfiguration , 2000, IEEE Trans. Reliab..