Techniques for on-demand structural redundancy for massively parallel processor arrays

Abstract In this paper, we present techniques for providing on-demand structural redundancy for Coarse-Grained Reconfigurable Array (CGRAs) and a calculus for determining the gains of reliability when applying these replication techniques from the perspective of safety-critical parallel loop program applications. Here, for protecting massively parallel loop computations against errors like soft errors, well-known replication schemes such as Dual Modular Redundancy (DMR) and Triple Modular Redundancy (TMR) must be applied to each single Processor Element (PE) rather than one based on application requirements for reliability and Soft Error Rates (SERs). Moreover, different voting options and signal replication schemes are investigated. It will be shown that hardware voting may be accomplished at negligible hardware cost, i. e. less than two percent area overhead per PE, for a class of reconfigurable processor arrays called Tightly Coupled Processor Arrays (TCPAs). As a major contribution of this paper, a formal analysis of the reliability achievable by each combination of replication and voting scheme for parallel loop executions on CGRAs in dependence of a given SER and application timing characteristics (schedule) is elaborated. Using this analysis, error detection latencies may be computed and proper decisions which replication scheme to choose at runtime to guarantee a maximal probability of failure on-demand can be derived. Finally, fault-simulation results are provided and compared with the formal analysis of reliability.

[1]  Jürgen Teich,et al.  On-demand fault-tolerant loop processing on massively parallel processor arrays , 2015, 2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[2]  Kazutoshi Kobayashi,et al.  EReLA: A Low-Power Reliable Coarse-Grained Reconfigurable Architecture Processor and Its Irradiation Tests , 2014, IEEE Transactions on Nuclear Science.

[3]  Jürgen Teich,et al.  Symbolic parallelization of loop programs for massively parallel processor arrays , 2013, 2013 IEEE 24th International Conference on Application-Specific Systems, Architectures and Processors.

[4]  Jürgen Teich,et al.  System integration of tightly-coupled processor arrays using reconfigurable buffer structures , 2013, CF '13.

[5]  Flaviu Cristian,et al.  Understanding fault-tolerant distributed systems , 1991, CACM.

[6]  Shekhar Y. Borkar,et al.  Thousand Core ChipsA Technology Perspective , 2007, 2007 44th ACM/IEEE Design Automation Conference.

[7]  Jun Yao,et al.  A Flexible, Self-Tuning, Fault-Tolerant Functional Unit Array Processor , 2014, IEEE Micro.

[8]  Wenhai Li,et al.  A Self-Adaptive SEU Mitigation System for FPGAs with an Internal Block RAM Radiation Particle Sensor , 2013, FCCM 2013.

[9]  Hiroyuki Ochi,et al.  A cost-effective selective TMR for heterogeneous coarse-grained reconfigurable architectures based on DFG-level vulnerability analysis , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[10]  Adrian Evans,et al.  Comprehensive analysis of alpha and neutron particle-induced soft errors in an embedded processor at nanoscales , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[11]  Masanori Hashimoto,et al.  Coarse-grained dynamically reconfigurable architecture with flexible reliability , 2009, 2009 International Conference on Field Programmable Logic and Applications.

[12]  Mahmut T. Kandemir,et al.  Compiler-assisted soft error detection under performance and energy constraints in embedded systems , 2009, TECS.

[13]  Kiyoung Choi,et al.  Software-Level Approaches for Tolerating Transient Faults in a Coarse-GrainedReconfigurable Architecture , 2014, IEEE Transactions on Dependable and Secure Computing.

[14]  Cristiana Bolchini A software methodology for detecting hardware faults in VLIW data paths , 2003, IEEE Trans. Reliab..

[15]  Jana Maria Heinsohn,et al.  Einführung in die ISO 26262 "Functional Safety - Road Vehicles" , 2011 .

[16]  Lothar Thiele,et al.  On the design of piecewise regular processor arrays , 1989, IEEE International Symposium on Circuits and Systems,.

[17]  V.B. Prasad,et al.  Fault tolerant digital systems , 1989, IEEE Potentials.

[18]  Jürgen Teich,et al.  Partitioning of processor arrays: a piecewise regular approach , 1993, Integr..

[19]  Edward J. McCluskey,et al.  Word-voter: a new voter design for triple modular redundant systems , 2000, Proceedings 18th IEEE VLSI Test Symposium.

[20]  Dan Alexandrescu A comprehensive soft error analysis methodology for SoCs/ASICs memory instances , 2011, 2011 IEEE 17th International On-Line Testing Symposium.

[21]  Jürgen Teich,et al.  Invasive Algorithms and Architectures Invasive Algorithmen und Architekturen , 2008, it Inf. Technol..

[22]  Olivier Sentieys,et al.  Error recovery technique for coarse-grained reconfigurable architectures , 2011, 14th IEEE International Symposium on Design and Diagnostics of Electronic Circuits and Systems.

[23]  R.C. Baumann,et al.  Radiation-induced soft errors in advanced semiconductor technologies , 2005, IEEE Transactions on Device and Materials Reliability.

[24]  Olivier Sentieys,et al.  Design of a fault-tolerant coarse-grained , 2010, 2010 11th International Symposium on Quality Electronic Design (ISQED).

[25]  Jürgen Teich,et al.  A Dynamically Reconfigurable Weakly Programmable Processor Array Architecture Template , 2006, ReCoSoC.

[26]  B. Ramakrishna Rau,et al.  Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing , 1981, MICRO 14.

[27]  Wei Zhang,et al.  Detecting VLIW Hard Errors Cost-Effectively through a Software-Based Approach , 2007, 21st International Conference on Advanced Information Networking and Applications Workshops (AINAW'07).

[28]  R. Bell,et al.  IEC 61508: functional safety of electrical/electronic/ programme electronic safety-related systems: overview , 1999 .

[29]  Tommy Kuhn,et al.  Using Run-Time Reconfiguration to Implement Fault-Tolerant Coarse Grained Reconfigurable Architectures , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[30]  Olivier Sentieys,et al.  DART: A Functional-Level Reconfigurable Architecture for High Energy Efficiency , 2008, EURASIP J. Embed. Syst..

[31]  Victor P. Nelson Fault-tolerant computing: fundamental concepts , 1990, Computer.

[32]  Tommy Kuhn,et al.  Low-Cost TMR for Fault-Tolerance on Coarse-Grained Reconfigurable Architectures , 2011, 2011 International Conference on Reconfigurable Computing and FPGAs.

[33]  Paul Feautrier,et al.  Automatic Parallelization in the Polytope Model , 1996, The Data Parallel Programming Model.

[34]  Frank Hannig,et al.  Invasive Tightly-Coupled Processor Arrays , 2014, ACM Trans. Embed. Comput. Syst..

[35]  Uday Bondhugula,et al.  PLuTo: A Practical and Fully Automatic Polyhedral Program Optimization System , 2015 .

[36]  Cristiana Bolchini,et al.  Reliability-Driven System-Level Synthesis for Mixed-Critical Embedded Systems , 2013, IEEE Transactions on Computers.

[37]  Shubu Mukherjee,et al.  Architecture Design for Soft Errors , 2008 .

[38]  Alan D. George,et al.  Reconfigurable Fault Tolerance: A Comprehensive Framework for Reliable and Adaptive FPGA-Based Space Computing , 2012, TRETS.

[39]  Rami G. Melhem,et al.  Loop Transformations for Fault Detection in Regular Loops on Massively Parallel Systems , 1996, IEEE Trans. Parallel Distributed Syst..

[40]  Jürgen Teich,et al.  Scheduling of partitioned regular algorithms on processor arrays with constrained resources , 1996, Proceedings of International Conference on Application Specific Systems, Architectures and Processors: ASAP '96.