On providing scalable self-healing adaptive fault-tolerance to RTR SoCs

The dependability of heterogeneous many-core FPGA based systems are threatened by higher failure rates caused by disruptive scales of integration, increased design complexity, and radiation sensitivity. Triple-modular redundancy (TMR) and run-time reconfiguration (RTR) are traditional fault-tolerant (FT) techniques used to increase dependability. However, hardware redundancy is expensive and most approaches have poor scalability, flexibility, and programmability. Therefore, innovative solutions are needed to reduce the redundancy cost but still preserve acceptable levels of dependability. In this context, this paper presents the implementation of a self-healing adaptive fault-tolerant SoC that reuses RTR IP-cores in order to self-assemble different TMR schemes during run-time. The presented system demonstrates the feasibility of the Upset-Fault-Observer concept, which provides a run-time self-test and recovery strategy that delivers fault-tolerance over functions accelerated in RTR cores, at the same time reducing the redundancy scalability cost by running periodic reconfigurable TMR scan-cycles. In addition, this paper experimentally evaluates the trade-off of the implemented reconfigurable TMR schemes by characterizing important fault tolerant metrics i.e., recovery time (self-repair and self-replicate), detection latency, self-assembly latency, throughput reduction, and increase of physical resources.

[1]  D GeorgeAlan,et al.  Reconfigurable Fault Tolerance , 2012 .

[2]  Alan D. George,et al.  Reconfigurable Fault Tolerance: A Comprehensive Framework for Reliable and Adaptive FPGA-Based Space Computing , 2012, TRETS.

[3]  Ronald F. DeMara,et al.  Sustainable Modular Adaptive Redundancy Technique Emphasizing Partial Reconfiguration for Reduced Power Consumption , 2011, Int. J. Reconfigurable Comput..

[4]  Julia Mathew,et al.  Fault Tolerance Technique for DynamicallyReconfigurable Processor , 2014 .

[5]  Johnny Öberg,et al.  Towards the generic reconfigurable accelerator: Algorithm development, core design, and performance analysis , 2013, 2013 International Conference on Reconfigurable Computing and FPGAs (ReConFig).

[6]  Johnny Öberg,et al.  The upset-fault-observer: A concept for self-healing adaptive fault tolerance , 2014, 2014 NASA/ESA Conference on Adaptive Hardware and Systems (AHS).

[7]  Scott Hauck,et al.  Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation , 2007 .

[8]  Johnny Öberg,et al.  The RecoBlock SoC platform: A flexible array of reusable Run-Time-Reconfigurable IP-blocks , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[9]  Grant Martin,et al.  Winning the SoC revolution : experiences in real design , 2003 .

[10]  A. DeHon,et al.  Architecture approaching the atomic scale , 2007, ESSCIRC 2007 - 33rd European Solid-State Circuits Conference.

[11]  Grant Martin,et al.  Winning the SoC Revolution , 2003, Springer US.

[12]  Martin Straka,et al.  Fault Tolerant Structure for SRAM-Based FPGA via Partial Dynamic Reconfiguration , 2010, DSD 2010.

[13]  Yongbin Zhou,et al.  Maximizing transient availability of real-time Onboard Reconfigurable Processing Platforms: An analytical redundancy inspired approach , 2008, 2008 International Conference on Information and Automation.

[14]  André DeHon,et al.  Law of large numbers system design , 2004 .

[15]  J. von Neumann,et al.  Probabilistic Logic and the Synthesis of Reliable Organisms from Unreliable Components , 1956 .

[16]  Dhiraj K. Pradhan,et al.  Processor- and memory-based checkpoint and rollback recovery , 1993, Computer.

[17]  Elena Dubrova,et al.  Fault-Tolerant Design , 2013 .

[18]  Gabriel L. Nazar,et al.  Radiation and Fault Injection Testing of a Fine-Grained Error Detection Technique for FPGAs , 2013, IEEE Transactions on Nuclear Science.

[19]  Heinrich Theodor Vierhaus,et al.  Virtual TMR Schemes Combining Fault Tolerance and Self Repair , 2013, 2013 Euromicro Conference on Digital System Design.

[20]  Hartmut Schmeck,et al.  Organic Computing - A Paradigm Shift for Complex Systems , 2011, Organic Computing.

[21]  Jürgen Becker,et al.  A study on fine granular fault tolerance methodologies for FPGAs , 2011, 6th International Workshop on Reconfigurable Communication-Centric Systems-on-Chip (ReCoSoC).

[22]  Matthew Parris,et al.  Progress in autonomous fault recovery of field programmable gate arrays , 2011, CSUR.

[23]  Steven Trimberger,et al.  A time-multiplexed FPGA , 1997, Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186).