Self-Adaptive Fault Tolerance in Multi-/Many-Core Systems

This paper presents a novel approach to the design of multi-/many-core systems with an adaptive level of reliability. The approach defines a layer at the operating system level that achieves fault detection/tolerance/diagnosis properties by means of thread replication and re-execution mechanisms. The layer applies the most convenient hardening mechanism to achieve the desired trade-off between reliability and performance by adapting at run-time to the changes of the working scenario. The proposed strategy has been applied in a set of experimental sessions considering a real-world parallel application, to evaluate its benefits on the final system with respect to various strategies selected at design time.

[1]  Olivier Héron,et al.  A Lightweight API for an Adaptive Software Fault Tolerance Using POSIX-Thread Replication , 2011, ARCS Workshops.

[2]  E. Normand Single event upset at ground level , 1996 .

[3]  Shubhendu S. Mukherjee,et al.  Detailed design and evaluation of redundant multithreading alternatives , 2002, ISCA.

[4]  Sarita V. Adve,et al.  Architectures for online error detection and recovery in multicore processors , 2011, 2011 Design, Automation & Test in Europe.

[5]  Michael J. Wirthlin,et al.  The reliability of FPGA circuit designs in the presence of radiation induced configuration upsets , 2003, 11th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 2003. FCCM 2003..

[6]  Paolo Meloni,et al.  System Adaptivity and Fault-Tolerance in NoC-based MPSoCs: The MADNESS Project Approach , 2012, 2012 15th Euromicro Conference on Digital System Design.

[7]  Engin Ipek,et al.  Utilizing Dynamically Coupled Cores to Form a Resilient Chip Multiprocessor , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[8]  Adrian Schüpbach,et al.  The multikernel: a new OS architecture for scalable multicore systems , 2009, SOSP '09.

[9]  Jeffrey O. Kephart,et al.  The Vision of Autonomic Computing , 2003, Computer.

[10]  Avi Mendelson,et al.  A Fault Detection and Recovery Architecture for a Teradevice Dataflow System , 2011, 2011 First Workshop on Data-Flow Execution Models for Extreme Scale Computing.

[11]  Antonino Tumeo,et al.  Performance modeling of parallel applications on MPSoCs , 2009, 2009 International Symposium on System-on-Chip.

[12]  Koushik Chakraborty,et al.  Mixed-mode multicore reliability , 2009, ASPLOS.

[13]  Donatella Sciuto,et al.  An adaptive approach for online fault management in many-core architectures , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[14]  Ming Yang,et al.  Self Adaptive Application Level Fault Tolerance for Parallel and Distributed Computing , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[15]  James E. Smith,et al.  Configurable isolation: building high availability systems with commodity multi-core processors , 2007, ISCA '07.

[16]  Ladan Tahvildari,et al.  Self-adaptive software: Landscape and research challenges , 2009, TAAS.

[17]  Alois Knoll,et al.  Analysis and optimization of fault-tolerant task scheduling on multiprocessor embedded systems , 2011, 2011 Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).