On-demand fault-tolerant loop processing on massively parallel processor arrays

We present a compilation-based technique for providing on-demand structural redundancy for massively parallel processor arrays. Thereby, application programmers gain the capability to trade throughput for reliability according to application requirements. To protect parallel loop computations against errors, we propose to apply the well-known fault tolerance schemes dual modular redundancy (DMR) and triple modular redundancy (TMR) to a whole region of the processor array rather than individual processing elements. At the source code level, the compiler realizes these replication schemes with a program transformation that: (1) replicates a parallel loop program two or three times for DMR or TMR, respectively, and (2) introduces appropriate voting operations whose frequency and location may be chosen from three proposed variants. Which variant to choose depends, for example, on the error resilience needs of the application or the expected soft error rates. Finally, we explore the different tradeoffs of these variants in terms of performance overheads and error detection latency.

[1]  Rami G. Melhem,et al.  Loop Transformations for Fault Detection in Regular Loops on Massively Parallel Systems , 1996, IEEE Trans. Parallel Distributed Syst..

[2]  B. Ramakrishna Rau,et al.  Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing , 1981, MICRO 14.

[3]  Martin C. Rinard,et al.  Automatically identifying critical input regions and code in applications , 2010, ISSTA '10.

[4]  Donald Yeung,et al.  Application-Level Correctness and its Impact on Fault Tolerance , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[5]  Masanori Hashimoto,et al.  Coarse-grained dynamically reconfigurable architecture with flexible reliability , 2009, 2009 International Conference on Field Programmable Logic and Applications.

[6]  Tommy Kuhn,et al.  Low-Cost TMR for Fault-Tolerance on Coarse-Grained Reconfigurable Architectures , 2011, 2011 International Conference on Reconfigurable Computing and FPGAs.

[7]  Cristiana Bolchini,et al.  A software methodology for detecting hardware faults in VLIW data paths , 2001, Proceedings 2001 IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems.

[8]  Nagarajan Kandasamy,et al.  A Dependable System Architecture for Safety-Critical Respiratory-Gated Radiation Therapy , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[9]  Jürgen Teich,et al.  Symbolic parallelization of loop programs for massively parallel processor arrays , 2013, 2013 IEEE 24th International Conference on Application-Specific Systems, Architectures and Processors.

[10]  Quinn Jacobson,et al.  ERSA: error resilient system architecture for probabilistic applications , 2010, DATE 2010.

[11]  Lothar Thiele,et al.  On the design of piecewise regular processor arrays , 1989, IEEE International Symposium on Circuits and Systems,.

[12]  Karthikeyan Sankaralingam,et al.  Relax: an architectural framework for software recovery of hardware faults , 2010, ISCA.

[13]  Mahmut T. Kandemir,et al.  Compiler-assisted soft error detection under performance and energy constraints in embedded systems , 2009, TECS.

[14]  Karthik Pattabiraman,et al.  Error detector placement for soft computation , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).