Novel lockstep-based fault mitigation approach for SoCs with roll-back and roll-forward recovery

Abstract All-Programmable System-on-Chips (APSoCs) constitute a compelling option for employing applications in radiation environments thanks to their high-performance computing and power efficiency merits. Despite these advantages, APSoCs are sensitive to radiation like any other electronic device. Processors embedded in APSoCs, therefore, have to be adequately hardened against ionizing-radiation to make them a viable choice of design for harsh environments. This paper proposes a novel lockstep-based approach to harden the dual-core ARM Cortex-A9 processor in the Xilinx Zynq-7000 APSoC against radiation-induced soft errors by coupling it with a MicroBlaze TMR subsystem in the programmable logic (PL) layer of the Zynq. The proposed technique uses the concepts of checkpointing along with roll-back and roll-forward mechanisms at the software level, i.e. software redundancy, as well as processor replication and checker circuits at the hardware level (i.e. hardware redundancy). Results of fault injection experiments show that the proposed approach achieves high levels of protection against soft errors by mitigating around 98% of bit-flips injected into the register files of both ARM cores while keeping timing performance overhead as low as 25% if block and application sizes are adjusted appropriately. Furthermore, the incorporation of the roll-forward recovery operation in addition to the roll-back operation improves the Mean Workload between Failures (MWBF) of the system by up to ≈19% depending on the nature of the running application, since the application can proceed faster, in a scenario where a fault occurs, when treated with the roll-forward operation rather than roll-back operation. Thus, relatively more data can be processed before the next error occurs in the system.

[1]  Eduardo Chielle,et al.  Analyzing the Impact of Radiation-Induced Failures in Programmable SoCs , 2016, IEEE Transactions on Nuclear Science.

[2]  David I. August,et al.  Automatic Instruction-Level Software-Only Recovery , 2006, IEEE Micro.

[3]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[4]  Jurgen Becker,et al.  HETA: Hybrid Error-Detection Technique Using Assertions , 2013, IEEE Transactions on Nuclear Science.

[5]  Raoul Velazco,et al.  Estimating error rates in processor-based architectures , 2000 .

[6]  Fernanda Gusmão de Lima Kastensmidt,et al.  Exploring Performance Overhead Versus Soft Error Detection in Lockstep Dual-Core ARM Cortex-A9 Processor Embedded into Xilinx Zynq APSoC , 2017, ARC.

[7]  Carl Carmichael,et al.  Triple Module Redundancy Design Techniques for Virtex FPGAs, Application Note 197 , 2001 .

[8]  Steven M. Guertin,et al.  Using Benchmarks for Radiation Testing of Microprocessors and FPGAs , 2015, IEEE Transactions on Nuclear Science.

[9]  Jens Lienig,et al.  Fundamentals of Electronic Systems Design , 2017 .

[10]  Fernanda Lima Kastensmidt,et al.  Lockstep Dual-Core ARM A9: Implementation and Resilience Analysis Under Heavy Ion-Induced Soft Errors , 2018, IEEE Transactions on Nuclear Science.

[11]  Robert Baumann,et al.  Soft errors in advanced computer systems , 2005, IEEE Design & Test of Computers.

[12]  Server Kasap,et al.  Survey of Soft Error Mitigation Techniques Applied to LEON3 Soft Processors on SRAM-Based FPGAs , 2020, IEEE Access.

[13]  T. Jayanthi,et al.  Understanding radiation effects in SRAM-based field programmable gate arrays for implementing instrumentation and control systems of nuclear power plants , 2017 .

[14]  Scott Hauck,et al.  Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation , 2007 .

[15]  Ricardo Reis,et al.  A Low-Cost Solution for Deploying Processor Cores in Harsh Environments , 2011, IEEE Transactions on Industrial Electronics.

[16]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[17]  Server Kasap,et al.  Survey of Lockstep based Mitigation Techniques for Soft Errors in Embedded Systems , 2019, 2019 11th Computer Science and Electronic Engineering (CEEC).

[18]  S. Rezgui,et al.  Predicting error rate for microprocessor-based digital architectures through C.E.U. (Code Emulating Upsets) injection , 2000 .

[19]  Dhiraj K. Pradhan,et al.  Roll-forward and rollback recovery: performance-reliability trade-off , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[20]  J. Bibb Cain,et al.  Error-Correction Coding for Digital Communications , 1981 .

[21]  Eduardo Chielle,et al.  S-SETA: Selective Software-Only Error-Detection Technique Using Assertions , 2015, IEEE Transactions on Nuclear Science.

[22]  Heather Quinn,et al.  Robust Duplication With Comparison Methods in Microcontrollers , 2017, IEEE Transactions on Nuclear Science.

[23]  Antonio Martínez-Álvarez,et al.  Dual-Core Lockstep enhanced with redundant multithread support and control-flow error detection , 2019, Microelectronics Reliability.

[24]  Michael J. Wirthlin,et al.  High-Reliability FPGA-Based Systems: Space, High-Energy Physics, and Beyond , 2015, Proceedings of the IEEE.

[25]  Sébastien Pillement,et al.  Low-overhead fault-tolerance technique for a dynamically reconfigurable softcore processor , 2013, IEEE Transactions on Computers.

[26]  Fabio Benevenuti,et al.  Reliability Calculation With Respect to Functional Failures Induced by Radiation in TMR Arm Cortex-M0 Soft-Core Embedded Into SRAM-Based FPGA , 2019, IEEE Transactions on Nuclear Science.

[27]  Antonio Martínez-Álvarez,et al.  Selective SWIFT-R , 2013, Journal of Electronic Testing.

[28]  James F. Dray,et al.  Advanced Encryption Standard (AES) , 2001 .

[29]  L. Sterpone,et al.  A New Hybrid Nonintrusive Error-Detection Technique Using Dual Control-Flow Monitoring , 2014, IEEE Transactions on Nuclear Science.

[30]  L. Carro,et al.  New Techniques for Improving the Performance of the Lockstep Architecture for SEEs Mitigation in FPGA Embedded Processors , 2009, IEEE Transactions on Nuclear Science.