On-Demand Fault-Tolerant Loop Processing

Since the feature sizes of silicon devices continue to shrink, it is imperative to counter the increasing proneness to errors of modern, complex systems by applying appropriate fault tolerance measures. In this chapter, we therefore propose new techniques that leverage the advantages of self-organizing computing paradigms such as invasive computing to implement fault tolerance on multiprocessor systems-on-chips (MPSoCs) adaptively. We presented new compile time transformations that introduce modular redundancy into a loop program to protect it against soft errors. Our approach uses the abundant number of processing elements (PEs) within a tightly coupled processor array (TCPA) to claim not only one region of a processor array, but instead two (dual modular redundancy (DMR)) or three (triple modular redundancy (TMR)) contiguous neighboring regions of PEs. At the source code level, the compiler realizes these replication schemes with a program transformation that: (1) replicates a parallel loop program two or three times for DMR or TMR, respectively, and (2) introduces appropriate voting operations whose frequency and location may be chosen from three proposed variants. Which variant to choose depends, for example, on the error resilience needs of the application or the expected soft error rates. Finally, we explore the different tradeoffs of these variants regarding performance overheads and error detection latency.

[1]  Shubu Mukherjee,et al.  Architecture Design for Soft Errors , 2008 .

[2]  Irith Pomeranz,et al.  Transient-fault recovery for chip multiprocessors , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[3]  Scott Mahlke,et al.  Efficient soft error protection for commodity embedded microprocessors using profile information , 2012, LCTES 2012.

[4]  Ming Zhang,et al.  Combinational Logic Soft Error Correction , 2006, 2006 IEEE International Test Conference.

[5]  Niraj K. Jha,et al.  COFTA : Hardware-Software Co-Synthesis of Heterogeneous Distributed Embedded Systems for Low Overhead Fault Tolerance , 1999 .

[6]  Jürgen Teich,et al.  Massively Parallel Processor Architectures for Resource-aware Computing , 2014, ArXiv.

[7]  Petru Eles,et al.  Synthesis of Fault-Tolerant Embedded Systems , 2008, 2008 Design, Automation and Test in Europe.

[8]  Michael Glaß,et al.  Invasive computing for timing-predictable stream processing on MPSoCs , 2016, it Inf. Technol..

[9]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[10]  Shubhendu S. Mukherjee,et al.  Detailed design and evaluation of redundant multithreading alternatives , 2002, ISCA.

[11]  Karthikeyan Sankaralingam,et al.  Relax: an architectural framework for software recovery of hardware faults , 2010, ISCA.

[12]  Mahmut T. Kandemir,et al.  Compiler-assisted soft error detection under performance and energy constraints in embedded systems , 2009, TECS.

[13]  Michael Glaß,et al.  Language and Compilation of Parallel Programs for *-Predictable MPSoC Execution Using Invasive Computing , 2016, 2016 IEEE 10th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSOC).

[14]  Jürgen Teich,et al.  Reconfigurable Buffer Structures for Coarse-Grained Reconfigurable Arrays , 2015, IESS.

[15]  Shekhar Y. Borkar,et al.  Thousand Core ChipsA Technology Perspective , 2007, 2007 44th ACM/IEEE Design Automation Conference.

[16]  Kevin Skadron,et al.  Cost-effective safety and fault localization using distributed temporal redundancy , 2011, 2011 Proceedings of the 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES).

[17]  Jürgen Teich,et al.  Techniques for on-demand structural redundancy for massively parallel processor arrays , 2015, J. Syst. Archit..

[18]  Jürgen Teich,et al.  Providing fault tolerance through invasive computing , 2016, it Inf. Technol..

[19]  B. Ramakrishna Rau,et al.  Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing , 1981, MICRO 14.

[20]  Donald Yeung,et al.  Application-Level Correctness and its Impact on Fault Tolerance , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[21]  Nagarajan Kandasamy,et al.  Transparent recovery from intermittent faults in time-triggered distributed systems , 2003 .

[22]  Martin C. Rinard,et al.  Automatically identifying critical input regions and code in applications , 2010, ISSTA '10.

[23]  Masanori Hashimoto,et al.  Coarse-grained dynamically reconfigurable architecture with flexible reliability , 2009, 2009 International Conference on Field Programmable Logic and Applications.

[24]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.

[25]  Cristiana Bolchini A software methodology for detecting hardware faults in VLIW data paths , 2003, IEEE Trans. Reliab..

[26]  Ravishankar K. Iyer,et al.  Recent advances and new avenues in hardware-level reliability support , 2005, IEEE Micro.

[27]  Jürgen Teich,et al.  On-demand fault-tolerant loop processing on massively parallel processor arrays , 2015, 2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[28]  Jürgen Teich,et al.  Adaptive fault tolerance through invasive computing , 2015, 2015 NASA/ESA Conference on Adaptive Hardware and Systems (AHS).

[29]  Rami G. Melhem,et al.  Loop Transformations for Fault Detection in Regular Loops on Massively Parallel Systems , 1996, IEEE Trans. Parallel Distributed Syst..

[30]  Babak Falsafi,et al.  Reunion: Complexity-Effective Multicore Redundancy , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[31]  Karthik Pattabiraman,et al.  Error detector placement for soft computation , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[32]  M. Shea,et al.  CREME96: A Revision of the Cosmic Ray Effects on Micro-Electronics Code , 1997 .

[33]  Victor P. Nelson Fault-tolerant computing: fundamental concepts , 1990, Computer.

[34]  Tommy Kuhn,et al.  Low-Cost TMR for Fault-Tolerance on Coarse-Grained Reconfigurable Architectures , 2011, 2011 International Conference on Reconfigurable Computing and FPGAs.

[35]  Quinn Jacobson,et al.  ERSA: error resilient system architecture for probabilistic applications , 2010, DATE 2010.

[36]  Alan Burns,et al.  Analysis of Checkpointing for Real-Time Systems , 2004, Real-Time Systems.

[37]  Jürgen Teich,et al.  A co-design approach for fault-tolerant loop execution on Coarse-Grained Reconfigurable Arrays , 2015, 2015 NASA/ESA Conference on Adaptive Hardware and Systems (AHS).

[38]  Edward J. McCluskey,et al.  Error detection by duplicated instructions in super-scalar processors , 2002, IEEE Trans. Reliab..

[39]  Michael Nicolaidis Time redundancy based soft-error tolerance to rescue nanometer technologies , 1999, Proceedings 17th IEEE VLSI Test Symposium (Cat. No.PR00146).

[40]  V.B. Prasad,et al.  Fault tolerant digital systems , 1989, IEEE Potentials.

[41]  Heinz Gall Functional safety IEC 61508 / IEC 61511 the impact to certification and the user , 2008, 2008 IEEE/ACS International Conference on Computer Systems and Applications.

[42]  Jürgen Teich,et al.  System integration of tightly-coupled processor arrays using reconfigurable buffer structures , 2013, CF '13.