Resource budgeting for reliability in reconfigurable architectures

SRAM-based reconfigurable architectures are susceptible to soft-errors. Accelerators in the reconfigurable fabric need to be protected by fault tolerance techniques such as modular redundancy and scrubbing. However, blindly applying these techniques to all accelerators leads to suboptimal performance due to overprotection. We introduce a metric, effective critical bits, to capture reliability impacting factors of the application. We present a method that performs budgeting of effective critical bits, i.e. decomposing the effective critical bits allowed by an application into effective critical bits allowed by its computational kernels and then into effective critical bits of their accelerated functions. This budgeting enables the runtime system to select appropriate accelerators and fault tolerance techniques to maximize the performance under a given target reliability. Compared to a strategy that duplicates all accelerators in the system, our method achieves up to 85% higher performance for a variety of reliability targets and soft-error rates.

[1]  E. Ibe,et al.  Impact of Scaling on Neutron-Induced Soft Error in SRAMs From a 250 nm to a 22 nm Design Rule , 2010, IEEE Transactions on Electron Devices.

[2]  Heather Quinn,et al.  Flight Experience of the Xilinx Virtex-4 , 2013, IEEE Transactions on Nuclear Science.

[3]  Wenhai Li,et al.  A Self-Adaptive SEU Mitigation System for FPGAs with an Internal Block RAM Radiation Particle Sensor , 2013, FCCM 2013.

[4]  K. Chapman SEU Strategies for Virtex-5 Devices , 2010 .

[5]  John Wawrzynek,et al.  Selectively Fortifying Reconfigurable Computing Device to Achieve Higher Error Resilience , 2012, J. Electr. Comput. Eng..

[6]  Dake Liu,et al.  An ASIC perspective on FPGA optimizations , 2009, 2009 International Conference on Field Programmable Logic and Applications.

[7]  Jason Cong,et al.  Accelerator-rich architectures: Opportunities and progresses , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[8]  H. Hughes,et al.  Radiation effects and hardening of MOS technology: devices and circuits , 2003 .

[9]  Michael J. Wirthlin,et al.  High-Reliability FPGA-Based Systems: Space, High-Energy Physics, and Beyond , 2015, Proceedings of the IEEE.

[10]  Israel Koren,et al.  A Continuous-Parameter Markov Model and Detection Procedures for Intermittent Faults , 1978, IEEE Transactions on Computers.

[11]  Cristiana Bolchini,et al.  Reliability-Driven System-Level Synthesis for Mixed-Critical Embedded Systems , 2013, IEEE Transactions on Computers.

[12]  Alan D. George,et al.  Reconfigurable Fault Tolerance: A Comprehensive Framework for Reliable and Adaptive FPGA-Based Space Computing , 2012, TRETS.

[13]  Marco Platzner,et al.  Design and architectures for dependable embedded systems , 2011, 2011 Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[14]  Muhammad Shafique,et al.  Cross-architectural design space exploration tool for reconfigurable processors , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[15]  M. Wirthlin,et al.  Improving FPGA Design Robustness with Partial TMR , 2006, 2006 IEEE International Reliability Physics Symposium Proceedings.

[16]  Alois Knoll,et al.  A framework for reliability-aware design exploration on MPSoC based systems , 2012, Design Automation for Embedded Systems.

[17]  Jörg Henkel,et al.  GUARD: GUAranteed reliability in dynamically reconfigurable systems , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[18]  Jürgen Teich,et al.  A Self-Adaptive SEU Mitigation System for FPGAs with an Internal Block RAM Radiation Particle Sensor , 2014, 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines.