Analyzing the Impact of Fault-Tolerance Methods in ARM Processors Under Soft Errors Running Linux and Parallelization APIs

This paper presents an analysis of the efficiency of traditional fault-tolerance methods on parallel systems running on top of Linux OS. It starts by studying the occurrence of software errors at systems presenting different levels of complexity, from sequential bare metal to parallel Linux applications. Then two traditional fault-tolerance mechanisms (triple modular redundancy and duplication with comparison variant) are applied to the applications and their efficiency analyzed. All cases were tested at the single and dual-core versions of an ARM Cortex-A9 processor that is embedded in many commercial system-on-a-chip. The OVP simulator platform is used to instantiate the processor model and to inject faults into the system. Faults are modeled as bit flips in the processor registers. Results show that traditional fault-tolerance algorithms are not efficient enough to protect a whole parallel system running on top of an operating system, given that the operating system itself is a major source of errors.

[1]  Brad L. Hutchings,et al.  Fault Injection Results of Linux Operating on an FPGA Embedded Platform , 2010, 2010 International Conference on Reconfigurable Computing and FPGAs.

[2]  Oriol Tintore Gazulla,et al.  Phonesat In-flight Experience Results , 2014 .

[3]  Ravishankar K. Iyer,et al.  Active replication of multithreaded applications , 2006, IEEE Transactions on Parallel and Distributed Systems.

[4]  L. Entrena,et al.  Partial TMR in FPGAs Using Approximate Logic Circuits , 2015, 2015 15th European Conference on Radiation and Its Effects on Components and Systems (RADECS).

[5]  Nacer-Eddine Zergainoh,et al.  Radiation Experiments on a 28 nm Single-Chip Many-Core Processor and SEU Error-Rate Prediction , 2017, IEEE Transactions on Nuclear Science.

[6]  Michael Hübner,et al.  Dynamic and partial reconfiguration of Zynq 7000 under Linux , 2013, 2013 International Conference on Reconfigurable Computing and FPGAs (ReConFig).

[7]  Andrea Marongiu,et al.  On the effectiveness of OpenMP teams for cluster-based many-core accelerators , 2016, 2016 International Conference on High Performance Computing & Simulation (HPCS).

[8]  L. Sterpone,et al.  An Analysis of SEU Effects in Embedded Operating Systems for Real-Time Applications , 2007, 2007 IEEE International Symposium on Industrial Electronics.

[9]  Joel Emer,et al.  A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[10]  David R. Kaeli,et al.  Quantifying software vulnerability , 2008, WREFT '08.

[11]  Ricardo Reis,et al.  Analyzing the impact of using pthreads versus OpenMP under fault injection in ARM Cortex-A9 dual-core , 2016, 2016 16th European Conference on Radiation and Its Effects on Components and Systems (RADECS).

[12]  James H. Adams,et al.  Single event upsets caused by solar energetic heavy ions , 1996 .

[13]  Len Buckwalter,et al.  Avionics Certification: A Complete Guide to DO-178 (Software), DO-254 (Hardware) , 2007 .

[14]  Ricardo Reis,et al.  A fast and scalable fault injection framework to evaluate multi/many-core soft error reliability , 2015, 2015 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFTS).

[15]  Brent Nelson,et al.  Synchronization Techniques for Crossing Multiple Clock Domains in FPGA-Based TMR Circuits , 2010, IEEE Transactions on Nuclear Science.

[16]  Nacer-Eddine Zergainoh,et al.  Evaluating SEU fault-injection on parallel applications implemented on multicore processors , 2015, 2015 IEEE 6th Latin American Symposium on Circuits & Systems (LASCAS).

[17]  Zaid Al-Ars,et al.  Efficient software-based fault tolerance approach on multicore platforms , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[18]  D. Hiemstra,et al.  Single Event Upset Characterization of the Zynq-7000 ARM® Cortex™-A9 Processor Unit Using Proton Irradiation , 2015, 2015 IEEE Radiation Effects Data Workshop (REDW).

[19]  Ravishankar K. Iyer,et al.  Characterization of linux kernel behavior under errors , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[20]  Luigi Carro,et al.  Reliability Analysis of Operating Systems for Embedded SoC , 2015, 2015 15th European Conference on Radiation and Its Effects on Components and Systems (RADECS).

[21]  Eduardo Chielle,et al.  Analyzing the Impact of Radiation-Induced Failures in Programmable SoCs , 2016, IEEE Transactions on Nuclear Science.