A co-design approach for fault-tolerant loop execution on Coarse-Grained Reconfigurable Arrays

We present a co-design approach to establish redundancy schemes such as Dual Modular Redundancy (DMR) and Triple Modular Redundancy (TMR) to a whole region of a processor array for a class of Coarse-Grained Reconfigurable Arrays (CGRAs). The approach is applied to applications with mixed-criticality properties and experiencing varying Soft Error Rates (SERs) due to environmental reasons, e. g., changing altitude. The core idea is to adapt the degree of fault protection for loop programs executing in parallel on a CGRA to the level of reliability required as well as SER profiles. This is realized through claiming neighbor regions of processing elements for the execution of replicated loop nests. First, at the source code level, a compiler transformation is proposed that realizes these replication schemes in two steps: (1) replicate given parallel loop program two or three times for DMR or TMR, respectively, and (2) add appropriate error handling functions (voting or comparison) in order to detect respectively correct any single errors. Then, using the opportunities of hardware/software co-design, we propose optimized implementations of the error handling functions in software as well as in hardware. Finally, experimental results are given for the analysis of reliability gains for each proposed scheme of array replication in dependence of different SERs.

[1]  Wenhai Li,et al.  A Self-Adaptive SEU Mitigation System for FPGAs with an Internal Block RAM Radiation Particle Sensor , 2013, FCCM 2013.

[2]  Alan D. George,et al.  Reconfigurable Fault Tolerance: A Comprehensive Framework for Reliable and Adaptive FPGA-Based Space Computing , 2012, TRETS.

[3]  Kevin Skadron,et al.  Cost-effective safety and fault localization using distributed temporal redundancy , 2011, 2011 Proceedings of the 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES).

[4]  Victor P. Nelson Fault-tolerant computing: fundamental concepts , 1990, Computer.

[5]  Rami G. Melhem,et al.  Loop Transformations for Fault Detection in Regular Loops on Massively Parallel Systems , 1996, IEEE Trans. Parallel Distributed Syst..

[6]  Tommy Kuhn,et al.  Low-Cost TMR for Fault-Tolerance on Coarse-Grained Reconfigurable Architectures , 2011, 2011 International Conference on Reconfigurable Computing and FPGAs.

[7]  Shubu Mukherjee,et al.  Architecture Design for Soft Errors , 2008 .

[8]  Coniferous softwood GENERAL TERMS , 2003 .

[9]  Jürgen Teich,et al.  Efficient Evaluation of Power/Area/Latency Design Trade-Offs for Coarse-Grained Reconfigurable Processor Arrays , 2011, J. Low Power Electron..

[10]  Frank Hannig,et al.  Invasive Tightly-Coupled Processor Arrays , 2014, ACM Trans. Embed. Comput. Syst..

[11]  Cristiana Bolchini,et al.  Reliability-Driven System-Level Synthesis for Mixed-Critical Embedded Systems , 2013, IEEE Transactions on Computers.

[12]  Mahmut T. Kandemir,et al.  Compiler-assisted soft error detection under performance and energy constraints in embedded systems , 2009, TECS.

[13]  Jürgen Teich,et al.  Hierarchical power management for adaptive tightly-coupled processor arrays , 2013, TODE.

[14]  Alan Burns,et al.  Analysis of Checkpointing for Real-Time Systems , 2004, Real-Time Systems.

[15]  Cristiana Bolchini,et al.  A software methodology for detecting hardware faults in VLIW data paths , 2001, Proceedings 2001 IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems.

[16]  V.B. Prasad,et al.  Fault tolerant digital systems , 1989, IEEE Potentials.

[17]  Heinz Gall Functional safety IEC 61508 / IEC 61511 the impact to certification and the user , 2008, 2008 IEEE/ACS International Conference on Computer Systems and Applications.

[18]  Petru Eles,et al.  Synthesis of Fault-Tolerant Embedded Systems , 2008, 2008 Design, Automation and Test in Europe.

[19]  Jürgen Teich,et al.  A Dynamically Reconfigurable Weakly Programmable Processor Array Architecture Template , 2006, ReCoSoC.

[20]  B. Ramakrishna Rau,et al.  Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing , 1981, MICRO 14.

[21]  Jürgen Teich,et al.  Partitioning of processor arrays: a piecewise regular approach , 1993, Integr..

[22]  Dan Alexandrescu A comprehensive soft error analysis methodology for SoCs/ASICs memory instances , 2011, 2011 IEEE 17th International On-Line Testing Symposium.

[23]  Jürgen Teich,et al.  Invasive Algorithms and Architectures Invasive Algorithmen und Architekturen , 2008, it Inf. Technol..

[24]  Lothar Thiele,et al.  On the design of piecewise regular processor arrays , 1989, IEEE International Symposium on Circuits and Systems,.

[25]  Shekhar Borkar Thousand Core ChipsA Technology Perspective , 2007, DAC 2007.

[26]  Jürgen Teich,et al.  Invasive Computing: An Overview , 2011, Multiprocessor System-on-Chip.

[27]  Jürgen Teich,et al.  Symbolic parallelization of loop programs for massively parallel processor arrays , 2013, 2013 IEEE 24th International Conference on Application-Specific Systems, Architectures and Processors.

[28]  Nagarajan Kandasamy,et al.  Transparent recovery from intermittent faults in time-triggered distributed systems , 2003 .

[29]  Alberto L. Sangiovanni-Vincentelli,et al.  Fault-tolerant platforms for automotive safety-critical applications , 2003, CASES '03.

[30]  Niraj K. Jha,et al.  COFTA : Hardware-Software Co-Synthesis of Heterogeneous Distributed Embedded Systems for Low Overhead Fault Tolerance , 1999 .