Towards a Heterogeneous Fault-Tolerance Architecture based on Arm and RISC-V Processors

Computer systems are permanently present in our daily basis in a wide range of applications. In systems with mixed-criticality requirements, e.g., autonomous driving or aerospace applications, devices are expected to continue operating properly even in the event of a failure. An approach to improve the robustness of the device's operation lies in enabling fault-tolerant mechanisms during the system's design. This article proposes Lock-V, a heterogeneous architecture that explores a Dual-Core Lockstep (DCLS) fault-tolerance technique in two different processing units: a hard-core Arm Cortex-A9 and a soft-core RISC-V-based processor. It resorts a System-on-Chip (SoC) solution with software programmability (available trough the hard-core Arm Cortex-A9) and field-programmable gate array (FPGA) technology, taking advantages from the latter to support the deployment of the RISC-V soft-core along with dedicated hardware accelerators towards the realization of the DCLS.

[1]  Jaume Abella,et al.  LiVe: Timely error detection in light-lockstep safety critical systems , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[2]  Tiago Gomes,et al.  CUTE Mote, A Customizable and Trustable End-Device for the Internet of Things , 2017, IEEE Sensors Journal.

[3]  Joseph Yiu Design of SoC for High Reliability Systems with Embedded Processors , 2016 .

[4]  M. Fazeli,et al.  An efficient technique to tolerate MBU faults in register file of embedded processors , 2012, The 16th CSI International Symposium on Computer Architecture and Digital Systems (CADS 2012).

[5]  Edward J. McCluskey,et al.  Design of redundant systems protected against common-mode failures , 2001, Proceedings 19th IEEE VLSI Test Symposium. VTS 2001.

[6]  Emre Ozer,et al.  Addressing Functional Safety Challenges in Autonomous Vehicles with the Arm TCL S Architecture , 2018, IEEE Design & Test.

[7]  Ryan D Kral,et al.  Implementation of a Loosely-Coupled Lockstep Approach in the Xilinx Zynq-7000 All Programmable SoC for High Consequence Applications , 2017 .

[8]  Edward J. McCluskey,et al.  A design diversity metric and reliability analysis for redundant systems , 1999, International Test Conference 1999. Proceedings (IEEE Cat. No.99CH37034).

[9]  Renato J. O. Figueiredo,et al.  A Flexible Approach to Improving System Reliability with Virtual Lockstep , 2012, IEEE Transactions on Dependable and Secure Computing.

[10]  Michel Pignol DMT and DT2: two fault-tolerant architectures developed by CNES for COTS-based spacecraft supercomputers , 2006, 12th IEEE International On-Line Testing Symposium (IOLTS'06).

[11]  Fernanda Gusmão de Lima Kastensmidt,et al.  Applying lockstep in dual-core ARM Cortex-A9 to mitigate radiation-induced soft errors , 2017, 2017 IEEE 8th Latin American Symposium on Circuits & Systems (LASCAS).

[12]  Fernanda Lima Kastensmidt,et al.  Lockstep Dual-Core ARM A9: Implementation and Resilience Analysis Under Heavy Ion-Induced Soft Errors , 2018, IEEE Transactions on Nuclear Science.

[13]  B. W. Johnson,et al.  Modeling of common-mode failures in digital embedded systems , 2000, Annual Reliability and Maintainability Symposium. 2000 Proceedings. International Symposium on Product Quality and Integrity (Cat. No.00CH37055).

[14]  S. Rezgui,et al.  Predicting error rate for microprocessor-based digital architectures through C.E.U. (Code Emulating Upsets) injection , 2000 .

[15]  Inseok Hwang,et al.  A Survey of Fault Detection, Isolation, and Reconfiguration Methods , 2010, IEEE Transactions on Control Systems Technology.

[16]  S. Montenegro,et al.  SPACE AND TIME PARTITIONING WITH HARDWARE SUPPORT FOR SPACE APPLICATIONS , 2016 .

[17]  Sparsh Mittal,et al.  A survey of techniques for improving error-resilience of DRAM , 2018, J. Syst. Archit..

[18]  Adam M. Izraelevitz,et al.  The Rocket Chip Generator , 2016 .

[19]  Algirdas Avizienis,et al.  Fault Tolerance by Design Diversity: Concepts and Experiments , 1984, Computer.

[20]  Francisco Carlos Afonso,et al.  Operating system fault tolerance support for real-time embedded applications , 2009 .

[21]  Elena Dubrova,et al.  Fault-Tolerant Design , 2013 .

[22]  Sébastien Pillement,et al.  Low-overhead fault-tolerance technique for a dynamically reconfigurable softcore processor , 2013, IEEE Transactions on Computers.

[23]  Shidhartha Das,et al.  A Triple Core Lock-Step (TCLS) ARM® Cortex®-R5 Processor for Safety-Critical and Ultra-Reliable Applications , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshop (DSN-W).

[24]  Mongkol Ekpanyapong,et al.  Towards a TrustZone-Assisted Hypervisor for Real-Time Embedded Systems , 2017, IEEE Computer Architecture Letters.

[25]  Paolo Rech,et al.  Register File Criticality and Compiler Optimization Effects on Embedded Microprocessor Reliability , 2017, IEEE Transactions on Nuclear Science.

[26]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.

[27]  Sergio Cuenca-Asensi,et al.  Hybrid soft error mitigation techniques for COTS processor-based systems , 2016, 2016 17th Latin-American Test Symposium (LATS).

[28]  Unai Bidarte,et al.  Fast context reloading lockstep approach for SEUs mitigation in a FPGA soft core processor , 2013, IECON 2013 - 39th Annual Conference of the IEEE Industrial Electronics Society.

[29]  Michael J. Campola,et al.  FPGA Mitigation Strategies for Critical Applications , 2018 .

[30]  Mohammed Karim,et al.  Dual-lockstep microblaze-based embedded system for error detection and recovery with reconfiguration technique , 2015, 2015 Third World Conference on Complex Systems (WCCS).

[31]  Mongkol Ekpanyapong,et al.  A Fault Tolerant Design Methodology for a FPGA-Based Softcore Processor , 2012, CESCIT.

[32]  Jaume Abella,et al.  High-Integrity GPU Designs for Critical Real-Time Automotive Systems , 2019, 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[33]  Shidhartha Das,et al.  Error Correlation Prediction in Lockstep Processors for Safety-Critical Systems , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[34]  Steven X. Ding,et al.  A Survey of Fault Diagnosis and Fault-Tolerant Techniques—Part I: Fault Diagnosis With Model-Based and Signal-Based Approaches , 2015, IEEE Transactions on Industrial Electronics.

[35]  Heather Quinn,et al.  Software Resilience and the Effectiveness of Software Mitigation in Microcontrollers , 2015, IEEE Transactions on Nuclear Science.

[36]  Ravishankar K. Iyer,et al.  An experimental study of soft errors in microprocessors , 2005, IEEE Micro.

[37]  Hoi-Jun Yoo,et al.  A 1GHz fault tolerant processor with dynamic lockstep and self-recovering cache for ADAS SoC complying with ISO26262 in automotive electronics , 2017, 2017 IEEE Asian Solid-State Circuits Conference (A-SSCC).

[38]  Adriano Tavares,et al.  Virtualization on TrustZone-Enabled Microcontrollers? Voilà! , 2019, 2019 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS).

[39]  Yue Sun,et al.  Research on Dual-Core Lock Step Mechanism and Its Application for Commercial High Performance APSoC , 2019 .

[40]  Antonio Martínez-Álvarez,et al.  Softerror mitigation for multi-core processors based on thread replication , 2019, 2019 IEEE Latin American Test Symposium (LATS).

[41]  Lorenzo Alvisi,et al.  Modeling the effect of technology trends on the soft error rate of combinational logic , 2002, Proceedings International Conference on Dependable Systems and Networks.

[42]  Edward J. McCluskey,et al.  Common-mode failures in redundant VLSI systems: a survey , 2000, IEEE Trans. Reliab..

[43]  L. Carro,et al.  New Techniques for Improving the Performance of the Lockstep Architecture for SEEs Mitigation in FPGA Embedded Processors , 2009, IEEE Transactions on Nuclear Science.

[44]  A. J. C. Lanot,et al.  Fault mitigation strategies for Single Event Transients on SAR converters , 2014, 19th Annual International Mixed-Signals, Sensors, and Systems Test Workshop Proceedings.

[45]  Fernanda Gusmão de Lima Kastensmidt,et al.  Analyzing lockstep dual-core ARM cortex-A9 soft error mitigation in FreeRTOS applications , 2017, 2017 30th Symposium on Integrated Circuits and Systems Design (SBCCI).

[46]  R.C. Baumann,et al.  Radiation-induced soft errors in advanced semiconductor technologies , 2005, IEEE Transactions on Device and Materials Reliability.

[47]  Heather Quinn,et al.  Robust Duplication With Comparison Methods in Microcontrollers , 2017, IEEE Transactions on Nuclear Science.

[48]  Jorge Pereira,et al.  Lightweight multicore virtualization architecture exploiting ARM TrustZone , 2017, IECON 2017 - 43rd Annual Conference of the IEEE Industrial Electronics Society.

[49]  J-C. Laprie,et al.  DEPENDABLE COMPUTING AND FAULT TOLERANCE : CONCEPTS AND TERMINOLOGY , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[50]  Andrew Waterman,et al.  The RISC-V Reader: An Open Architecture Atlas , 2017 .

[51]  Tiago Gomes,et al.  DBTOR: A Dynamic Binary Translation Architecture for Modern Embedded Systems , 2019, 2019 IEEE International Conference on Industrial Technology (ICIT).

[52]  Pedro Reviriego,et al.  Efficient Protection of the Register File in Soft-Processors Implemented on Xilinx FPGAs , 2018, IEEE Transactions on Computers.

[53]  Hermann Kopetz,et al.  Real-time systems , 2018, CSC '73.

[54]  Mahmut T. Kandemir,et al.  Increasing register file immunity to transient errors , 2005, Design, Automation and Test in Europe.