FMER: A hybrid configuration memory error recovery scheme for highly reliable FPGA SoCs

High-reliability SRAM-based Field Programmable Gate Array (FPGA) designs that are deployed in space are commonly triplicated to mask Single Event Upsets (SEUs) and employ either scrubbing or modular reconfiguration to recover from radiation-induced configuration memory errors. Scrubbing benefits from vendor support and clears errors anywhere in the design but suffers from longer recovery times and higher energy use. Module-based error recovery is more energy efficient and responsive but repairs only corrupted TMR modules, leaving the supporting parts of the design such as pins or routing that are not included in the modules unrecovered. This paper proposes and assesses a hybrid technique we refer to as Frame- and Module-based Error Recovery (FMER) that uses modular reconfiguration to repair faulty TMR modules and otherwise scrubs the supporting parts of the design. We derive and compare the reliability, availability and power consumption of TMR-based System on Chip (SoC) designs that incorporate FMER, modular reconfiguration alone, blind scrubbing and no recovery. Our results reveal that FMER has the highest reliability and availability of the studied techniques in high radiation environments or when a mission's energy budget is limited.

[1]  D. Hiemstra,et al.  Single Event Upset Characterization of the Kintex-7 Field Programmable Gate Array Using Proton Irradiation , 2014, 2014 IEEE Radiation Effects Data Workshop (REDW).

[2]  M. Shea,et al.  CREME96: A Revision of the Cosmic Ray Effects on Micro-Electronics Code , 1997 .

[3]  Ricardo Reis,et al.  Energy efficient frame-level redundancy scrubbing technique for SRAM-based FPGAs , 2015, 2015 NASA/ESA Conference on Adaptive Hardware and Systems (AHS).

[4]  Tanya Vladimirova,et al.  Mitigation of Radiation Effects in SRAM-Based FPGAs for Space Applications , 2014, ACM Comput. Surv..

[5]  David M. Hiemstra,et al.  Single Event Upset Characterization of the Virtex-5 Field Programmable Gate Array Using Proton Irradiation , 2010, 2010 IEEE Radiation Effects Data Workshop.

[6]  J. Rupe Reliability of Computer Systems and Networks Fault Tolerance, Analysis, and Design , 2003 .

[7]  Andrew G. Dempster,et al.  Overview and Investigation of SEU Detection and Recovery Approaches for FPGA-Based Heterogeneous Systems , 2016 .

[8]  Tong Wu,et al.  Reconfiguration Control Networks for TMR Systems with Module-Based Recovery , 2016, 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[9]  Brent E. Nelson,et al.  RapidSmith: Do-It-Yourself CAD Tools for Xilinx FPGAs , 2011, 2011 21st International Conference on Field Programmable Logic and Applications.

[10]  Marco D. Santambrogio,et al.  TMR and Partial Dynamic Reconfiguration to mitigate SEU faults in FPGAs , 2007, 22nd IEEE International Symposium on Defect and Fault-Tolerance in VLSI Systems (DFT 2007).