AWAIT: An Ultra-Lightweight Soft-Error Mitigation Mechanism for Network-on-Chip Links

Networks-on-Chip have become a widely accepted communication paradigm for many-core Systems-on-Chip. However, with the ever-shrinking transistor size, the network's sensitivity to transient faults on the physical links cannot be ignored since even a single transient fault can lead to a network-wide congestion and a system failure. This paper proposes the AWAIT mechanism, an ultra-lightweight transient fault mitigation mechanism for Network-on-Chip links. The proposed mechanism covers all single event transients. The experimental results show that the AWAIT mechanism prevents network-wide failure even in harsh environments (up to 80 million random faults on links per second). The AWAIT mechanism is also scalable and imposes only 5.1 % area overhead with very negligible critical path delay overhead.

[1]  Luigi Carro,et al.  Dependable Network-on-Chip Router Able to Simultaneously Tolerate Soft Errors and Crosstalk , 2006, 2006 IEEE International Test Conference.

[2]  Kwang-Ting Cheng,et al.  End-to-end error correction and online diagnosis for on-chip networks , 2011, 2011 IEEE International Test Conference.

[3]  Selma Saidi,et al.  Designing Networks-on-Chip for High Assurance Real-Time Systems , 2017, 2017 IEEE 22nd Pacific Rim International Symposium on Dependable Computing (PRDC).

[4]  Nur A. Touba,et al.  Reliable Network-on-Chip Using a Low Cost Unequal Error Protection Code , 2007, 22nd IEEE International Symposium on Defect and Fault-Tolerance in VLSI Systems (DFT 2007).

[5]  Vinicius Fochi,et al.  A Hierarchical and Distributed Fault Tolerant Proposal for NoC-Based MPSoCs , 2018, IEEE Transactions on Emerging Topics in Computing.

[6]  Luigi Carro,et al.  Crosstalk- and SEU-Aware Networks on Chips , 2007, IEEE Design & Test of Computers.

[7]  Chita R. Das,et al.  Exploring Fault-Tolerant Network-on-Chip Architectures , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[8]  Raimund Ubar,et al.  From online fault detection to fault management in Network-on-Chips: A ground-up approach , 2017, 2017 IEEE 20th International Symposium on Design and Diagnostics of Electronic Circuits & Systems (DDECS).

[9]  Andreas Steininger,et al.  Protecting an Asynchronous NoC against Transient Channel Faults , 2012, 2012 15th Euromicro Conference on Digital System Design.

[10]  Martin Radetzki,et al.  Fault Localizing End-to-End Flow Control Protocol for Networks-on-Chip , 2013, 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[11]  José Duato,et al.  Logic-Based Distributed Routing for NoCs , 2008, IEEE Computer Architecture Letters.

[12]  Ulf Schlichtmann,et al.  Aging analysis of circuit timing considering NBTI and HCI , 2009, 2009 15th IEEE International On-Line Testing Symposium.

[13]  Zhonghai Lu,et al.  Multi-bit transient fault control for NoC links using 2D fault coding method , 2016, 2016 Tenth IEEE/ACM International Symposium on Networks-on-Chip (NOCS).

[14]  Alexandre Yakovlev,et al.  Asynchronous transient resilient links for NoC , 2008, CODES+ISSS '08.

[15]  Michael Welzl,et al.  An efficient fault tolerant mechanism to deal with permanent and transient failures in a network on chip , 2007, Int. J. High Perform. Syst. Archit..

[16]  Ahmed Louri,et al.  Dynamic error mitigation in NoCs using intelligent prediction techniques , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).