Providing fault tolerance through invasive computing

Abstract As a consequence of technology scaling, today's complex multi-processor systems have become more and more susceptible to errors. In order to satisfy reliability requirements, such systems require methods to detect and tolerate errors. This entails two major challenges: (a) providing a comprehensive approach that ensures fault-tolerant execution of parallel applications across different types of resources, and (b) optimizing resource usage in the face of dynamic fault probabilities or with varying fault tolerance needs of different applications. In this paper, we present a holistic and adaptive approach to provide fault tolerance on Multi-Processor System-on-a-Chip (MPSoC) on demand of an application or environmental needs based on invasive computing. We show how invasive computing may provide adaptive fault tolerance on a heterogeneous MPSoC including hardware accelerators and communication infrastructure such as a Network-on-Chip (NoC). In addition, we present (a) compile-time transformations to automatically adopt well-known redundancy schemes such as Dual Modular Redundancy (DMR) and Triple Modular Redundancy (TMR) for fault-tolerant loop execution on a class of massively parallel arrays of processors called as Tightly Coupled Processor Arrays (). Based on timing characteristics derived from our compilation flow, we further develop (b) a reliability analysis guiding the selection of a suitable degree of fault tolerance. Finally, we present (c) a methodology to detect and adaptively mitigate faults in invasive NoCs.

[1]  Paolo Meloni,et al.  System Adaptivity and Fault-Tolerance in NoC-based MPSoCs: The MADNESS Project Approach , 2012, 2012 15th Euromicro Conference on Digital System Design.

[2]  Jürgen Teich,et al.  On-demand fault-tolerant loop processing on massively parallel processor arrays , 2015, 2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[3]  Kevin Klues,et al.  Processes and Resource Management in a Scalable Many-core OS ∗ , 2010 .

[4]  Sarita V. Adve,et al.  Architectures for online error detection and recovery in multicore processors , 2011, 2011 Design, Automation & Test in Europe.

[5]  Michael Glaß,et al.  DAARM: Design-time application analysis and run-time mapping for predictable execution in many-core systems , 2014, 2014 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[6]  Lothar Thiele,et al.  On the design of piecewise regular processor arrays , 1989, IEEE International Symposium on Circuits and Systems,.

[7]  Jürgen Teich,et al.  CAP: Communication aware programming , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[8]  Vahid Lari Invasive Tightly Coupled Processor Arrays , 2016 .

[9]  Jürgen Teich,et al.  Fault-tolerant communication in invasive networks on chip , 2015, 2015 NASA/ESA Conference on Adaptive Hardware and Systems (AHS).

[10]  Luca Benini,et al.  Analysis of error recovery schemes for networks on chips , 2005, IEEE Design & Test of Computers.

[11]  Jürgen Teich,et al.  The Invasive Network on Chip - A Multi-Objective Many-Core Communication Infrastructure , 2014, ARCS Workshops.

[12]  Ahmad Patooghy,et al.  XYX: A Power & Performance Efficient Fault-Tolerant Routing Algorithm for Network on Chip , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[13]  Radu Marculescu,et al.  FARM: Fault-aware resource management in NoC-based multiprocessor platforms , 2011, 2011 Design, Automation & Test in Europe.

[14]  Tommy Kuhn,et al.  Low-Cost TMR for Fault-Tolerance on Coarse-Grained Reconfigurable Architectures , 2011, 2011 International Conference on Reconfigurable Computing and FPGAs.

[15]  Wenhai Li,et al.  A Self-Adaptive SEU Mitigation System for FPGAs with an Internal Block RAM Radiation Particle Sensor , 2013, FCCM 2013.

[16]  Jürgen Teich,et al.  Techniques for on-demand structural redundancy for massively parallel processor arrays , 2015, J. Syst. Archit..

[17]  Mario Porrmann,et al.  Self-optimization of MPSoCs Targeting Resource Efficiency and Fault Tolerance , 2009, 2009 NASA/ESA Conference on Adaptive Hardware and Systems.

[18]  Manfred Glesner,et al.  Deadlock-free routing and component placement for irregular mesh-based networks-on-chip , 2005, ICCAD-2005. IEEE/ACM International Conference on Computer-Aided Design, 2005..

[19]  Axel Jantsch,et al.  Methods for fault tolerance in networks-on-chip , 2013, CSUR.

[20]  Jürgen Teich,et al.  Invasive Computing: An Overview , 2011, Multiprocessor System-on-Chip.

[21]  Jürgen Teich,et al.  Symbolic parallelization of loop programs for massively parallel processor arrays , 2013, 2013 IEEE 24th International Conference on Application-Specific Systems, Architectures and Processors.

[22]  Howard Jay Siegel,et al.  OE+IOE: A novel turn model based fault tolerant routing scheme for networks-on-chip , 2010, 2010 IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[23]  David Cunningham,et al.  Resilient X10: efficient failure-aware programming , 2014, PPoPP '14.

[24]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[25]  Alan D. George,et al.  Reconfigurable Fault Tolerance: A Comprehensive Framework for Reliable and Adaptive FPGA-Based Space Computing , 2012, TRETS.

[26]  Jürgen Teich,et al.  Adaptive fault tolerance through invasive computing , 2015, 2015 NASA/ESA Conference on Adaptive Hardware and Systems (AHS).

[27]  Jürgen Teich,et al.  Partitioning of processor arrays: a piecewise regular approach , 1993, Integr..

[28]  Cristiana Bolchini,et al.  A software methodology for detecting hardware faults in VLIW data paths , 2001, Proceedings 2001 IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems.

[29]  Heinz Gall Functional safety IEC 61508 / IEC 61511 the impact to certification and the user , 2008, 2008 IEEE/ACS International Conference on Computer Systems and Applications.

[30]  Jürgen Teich,et al.  Invasive Algorithms and Architectures Invasive Algorithmen und Architekturen , 2008, it Inf. Technol..

[31]  Jürgen Teich,et al.  Efficient Evaluation of Power/Area/Latency Design Trade-Offs for Coarse-Grained Reconfigurable Processor Arrays , 2011, J. Low Power Electron..

[32]  Dionisios N. Pnevmatikatos,et al.  The DeSyRe Runtime Support for Fault-Tolerant Embedded MPSoCs , 2014, 2014 IEEE International Symposium on Parallel and Distributed Processing with Applications.

[33]  Jürgen Teich,et al.  A co-design approach for fault-tolerant loop execution on Coarse-Grained Reconfigurable Arrays , 2015, 2015 NASA/ESA Conference on Adaptive Hardware and Systems (AHS).

[34]  Ching-Te Chiu,et al.  On the design and analysis of fault tolerant NoC architecture using spare routers , 2011, 16th Asia and South Pacific Design Automation Conference (ASP-DAC 2011).

[35]  Cristiana Bolchini,et al.  Reliability-Driven System-Level Synthesis for Mixed-Critical Embedded Systems , 2013, IEEE Transactions on Computers.

[36]  J. Teich,et al.  Scheduling of Partitioned Regular Algorithms on Processor Arravs with , 1996 .

[37]  Syed M. A. H. Jafri,et al.  Design of a Fault-Tolerant Coarse-Grained Reconfigurable Architecture : A Case Study , 2010 .